Приглашаем посетить
Батюшков (batyushkov.lit-info.ru)

Section 8.4.  Fixed-Width Data

Previous
Table of Contents
Next

8.4. Fixed-Width Data

Use unpack to extract fixed-width fields.

Fixed-width text data:


    X123-S000001324700000199
    SFG-AT000000010200009099
    Y811-Q000010030000000033

is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow:

    
    # Specify field locations...
    Readonly my %FIELD_POS => (ident=>0,  sales=>6,   price=>16);
    Readonly my %FIELD_LEN => (ident=>6,  sales=>10,  price=>8);

    # Grab each line/record...
    while (my $record = <$sales_data>) {

        # Extract each field...
        my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident});
        my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales});
        my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price});

        # Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)...
        push @sales, {
            ident => translate_ID($ident),
            sales => $sales * 1000,

            price => $price,
        };
    }

Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast:

    
    # Specify order and lengths of fields...
    Readonly my $RECORD_LAYOUT
        => qr/\A (.{6}) (.{10}) (.{8}) /xms;

    # Grab each line/record...
    while (my $record = <$sales_data>) {

        # Extract all fields...
        my ($ident, $sales, $price)
            = $record =~ m/ $RECORD_LAYOUT /xms;

        # Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)...
        push @sales, {
            ident => translate_ID($ident),
            sales => $sales * 1000,
            price => $price,
        };
    }

The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings:


    

    # Specify order and lengths of fields
... Readonly my $RECORD_LAYOUT => 'A6 A10 A8';
# 6 ASCII, then 10 ASCII, then 8 ASCII

    # Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields...
my ($ident, $sales, $price) = unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example:

    X123-S  0000013247  00000199
    SFG-AT  0000000102  00009099
    Y811-Q  0000100300  00000033

When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example:


    

    # Specify order and lengths of fields
... Readonly my $RECORD_LAYOUT => '@0 A6 @8 A10 @20 A8';
# At column zero extract 6 ASCII chars
                                   # then at column 8 extract 10,
                                   # then at column 20 extract 8.

    # Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields
... my ($ident, $sales, $price) = unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example:


    

    # Specify order and lengths of fields...
Readonly my %RECORD_LAYOUT => (

    #  Ident   Sales   Price
Unspaced => ' A6 A10 A8',
# Legacy layout
Spaced => ' @0 A6 @8 A10 @20 A8',
# Standard layout
ID_last => '@21 A6 @0 A10 @12 A8',
# New, more convenient layout
);
# Select record layout
... my $layout_name = get_layout($filename);
# Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields
... my ($ident, $sales, $price) = unpack $RECORD_LAYOUT{$layout_name}, $record;
# Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table.

Note that the entry for $RECORD_LAYOUT{ID_last}:


        ID_last => '@21 C6  @0 C10  @12 C8',

makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop:


        my ($ident, $sales, $price)
            = unpack $RECORD_LAYOUT{$layout_name}, $record;

will extract the record ID before the sales amount and the price, even though the ID field comes after those other two fields in the file.

    Previous
    Table of Contents
    Next