Приглашаем посетить

Мамин-Сибиряк (mamin-sibiryak.lit-info.ru)

4.9 Antipatterns

An inexperienced programmer can create code that is far more convoluted than necessary, resulting in decreased maintainability and possibly decreased performance. A team once told me they were about to buy bigger hardware to run a program that was taking more than two hours and 1 GB of RAM each time it ran. When I looked at the program, I saw that it used parallel arrays and linear searches instead of hashes, read large amounts of input into memory, and sorted it there. After editing, a run consumed 20 minutes, 7 MB, and half as much source code. Now I'll cover some inefficient practices to look out for.^[11]

^[11] Thanks to Adam Turoff for pointing out that these are best called "antipatterns" [BROWN98].

4.9.1 Not Using Hashes

The hash has to be the most useful data structure ever invented. This book assumes that you are familiar with their use and therefore know why they are so useful. However, the programmer who wrote your legacy code may not have known this. The complete absence of a hash from that code is a strong clue that the programmer didn't know about hashes, because it's a rare program that can't make good use of a hash somewhere.

Look out for linear searches through arrays. The code may look like this:


my $found = 0;

foreach my $element (@array)

{

  $found = 1 and last if $element eq $search;

}

if ($found) ...

or:


my $found_index = -1;

for (my $i = 0; $i <= $#array; $i++)

{

  $found_index = $i, last if $array[$i] eq $search;

}

if ($found_index >= 0) ...

or even:


if (grep $_ eq $search, @array) ...

It's common to create a hash just so you can do lookups in constant time instead of having to perform linear searches over an array. If you only want to do one lookup, then go ahead and use grep(), because creating a hash just for that purpose will be slower and occupy more code. However, consider whether the data should have been stored in a hash in the first place. Needing to perform a lookup or other random access on an array usually indicates that it should have been a hash to begin with.

If you don't care about the ordering of the data, a hash will work. If you do care about the order, sometimes you can use an array in conjunction with a hash. For instance, suppose you are reading information about American states. The data arrives in a particular order, but you want to store it in a two-level hash for random lookup:


my @COLS = qw(NAME CAPITAL LARGEST_CITY BIRD FLOWER);

my %state;

while (<DATA>)

{

  my ($abbreviation, @vals) = split;

  @{$state{$abbreviation}}{@cols} = @vals;

}

__END__

AL Alabama Montgomery Birmingham Yellowhammer Camellia

...

4.9.2 Magic Numbers

Literal numbers, and some literal strings, in code should be given symbolic names and declared at the head of the program or in some form of configuration input. For example, instead of:


open_server('dbms_temp', 442);

write:


my $DBMS_SERVER = 'dbms_temp';  # Database server name

my $DBMS_PORT   = 442;          # Database server port

.

.

.

open_server($DBMS_SERVER, $DBMS_PORT);

This may occupy more space, but it has three important benefits:

It attaches a meaning to the value being passed (make the variable or constant name as descriptive as possible).
It allows you to identify multiple uses of the same thing as opposed to multiple unrelated occurrences of (coincidentally) the same value.
It puts all the kinds of values that maintenance programmers are likely to want to change in one place where they can change them without having to wade through code you'd rather they didn't change.

These advantages are so key (and applicable to every programming language), you'd wonder why anyone would not follow this rule. Unfortunately, many people don't, and during editing one of the first things you can do is to replace "magic numbers" with symbolic names.

There are cases in which this would be overkill. For instance, subroutine calls with named parameters already tell you what the purpose of a value is:


trim_tree(angels => 1, balls => 14, lights => 35);

However, it is probably still advantageous to give these values names so they can be adjusted in a global configuration section:


my $NUM_ANGELS = 1;

my $NUM_BALLS  = 14;

my $NUM_LIGHTS = 35;

...

trim_tree(angels => $NUM_ANGELS, balls => $NUM_BALLS,

          lights => $NUM_LIGHTS);

or, for that matter:


my %TREE_TRIM_OPTS = (angels => 1, balls => 14, lights => 35);

...

trim_tree(%TREE_TRIM_OPTS);

It may be overkill to use symbolic names for values like 0 and 1, especially if the context is obvious or if you definitely do not want any maintenance programmer changing the assignments, but even here it is possible to make improvements. For instance, subroutines that want a true or false value in a positional parameter won't (or shouldn't) care what kind of true or false value is used, so you can substitute a value that happens to be true but is also symbolic.

So instead of:


my $p = HTML::TreeBuilder->new->parse(get($URL));

$p->traverse(\&element_proc, 1);

say:


my $p = HTML::TreeBuilder->new->parse(get($URL));

$p->traverse(\&element_proc, 'IGNORE_TEXT');

to give you a reminder of what the second parameter to traverse() is for.

4.9.3 Ineffective Regular Expressions

Regular expressions are such a huge topic that they deserve their own book . . . and fortunately they already have one [FRIEDL02]. It's a rare Perl program that can't put a regular expression to good use, and if you inherit a program that doesn't have any, take a hard look at it to see whether it should have some. If someone slices up text using substr() and/or split() (which uses a regular expression anyway, although it can be disguised as a string), or searches for multiple substrings using index(), either they are being very smart in solving a problem that requires maximum performance, or they weren't fluent in the language of regular expressions.

Matching text with a regular expression is an art; you want the simplest regex that will match what you want in every possible case. It may not be necessary to make the regex as restrictive as possible. For instance, suppose you are matching uids in lines from /etc/passwd:


peter:x:500:500:Peter Scott:/home/peter:/bin/tcsh

The colon-separated fields in this sample line are, in order: username, password, uid, gid, GECOS, directory, shell. The non-regex solution looks like this:


($username, $password, $uid, $gid, $gecos, $dir, $shell) =

split /:/;

or, more mercifully:


@fields = split /:/;

$uid = $fields[2];

Now, a regex-based solution looks hideous if it insists on matching the whole line according to rigid syntax rules:


$uid = $1 if /^[a-z][a-z0-9_]*:.*?:(\d+):\d+(?::.*?){3}$/;

but really, why bother? In an /etc/passwd line, the uid is the first string composed solely of digits between two colons, because usernames cannot start with digits. So /:(\d+)/ will match a uid. It would be paranoid code indeed that needed to check that a line in /etc/passwd had valid syntax, considering the consequences for the system if a line was corrupted; you'd already know about it.

Is it possible for the encrypted password to be composed only of digits? Most /etc/passwd files have just an "x" in that field these days, leaving the real encrypted password to a more protected /etc/shadow file or an external authentication source, in which case the answer is no. Even if we assume that it might be possible for that password field to be all digits, the code for matching the uid need be no more complex than


($uid) = /:.*?:(.*?):/;

For another example, suppose we're reading an accounting report containing lines like:


Total expenditure = $14731.00, fiscal July

and we want to extract the dollar amount. What regular expression will do? We don't know, without knowing more about what text could be on those lines. One possibility is to extract text between the $ and ,:


/\$(.*?),/ and $amount = $1;

but we would need to know that there couldn't be lines like, for instance:


surplus of $7443.95 was carried over to August, due to

where our regex will match the underlined text. In that event, a regular expression like /\$[\d.]+/ may suffice. Or maybe not, if the report might resort to (admittedly unfiscal) scientific notation:


Total expenditure = $7.6324E8, fiscal August

in which case we might resort to any of:


/\$(\S+),/

/expenditure = $([^,\s]+)/

/\$([\d.e+- ])/i

depending entirely on our knowledge of the forms the input can take and how much the lines we want to match can vary.

If the program appears to be using regular expressions for matching any kind of common pattern, then be lazy and use a regex that was generated just for that purpose by someone else's code. Damian Conway's module Regexp::Common (now maintained by Abigail) can generate extremely complex regexes for solving common problems (http://search.cpan.org/dist/Regexp-Common/). For matching delimited text—even blocks of Perl code—there's Damian's Text::Balanced (in the Perl core as of version 5.8.1), and for really heavy-duty parsing, there are the full grammar capabilities of his Parse::RecDescent module (http://search.cpan.org/dist/Parse-RecDescent/).

4.9.4 Calling Too Many External Programs

A programmer who has migrated to Perl programming from shell scripting is used to calling external programs to do just about everything. Text manipulation in shell programs is accomplished through carefully combining the uses of sed, awk, head, grep, cut, sort, and other programs. If a Perl programmer of level 6 or lower calls these programs then they are probably unnecessary, whereas a programmer of higher than level 6 expertise may know what they are doing (see the next section).

Text manipulation is far easier to do in Perl than by calling external programs: You only have one language to learn, the code is portable to systems that don't have those programs, and it is likely to be much faster because of avoiding process creation and interprocess communication overhead. Furthermore, the code is likely to be much briefer and clearer in Perl.

Perl was written to be familiar to shell programmers and users of the common filter programs; often you can take a call to an external program and turn it into Perl without any change to the substantive code. So for instance, instead of:


open OUT, ">/tmp/sedin";

print OUT map "$_\n", @entries;

close OUT;

system "sed -e 's/apples/oranges/g' /tmp/sedin > /tmp/sedout";

open IN, "/tmp/sedout";

while (<IN>)

{

  chomp;

  push @new_entries, $_;

}

@entries = @new_entries;

which has many other problems to boot, just say:


s/apples/oranges/g for @entries;

Even the substantive code will likely be briefer in Perl if regular expressions are involved, due to the expressive power of Perl's regular expressions.

4.9.5 Not Calling Enough External Programs

There's no need to be dogmatic about avoiding external programs, however. Some external programs can do a job faster and clearer than Perl and if there's no need for portability, then you should use them. For instance, suppose you have created an entire tree of temporary files under a root directory named by $temptree. The average programmer would concoct a recursive depth-first unlink scheme using File::Find, because the majority aren't aware of File::Path, which lets you say:


use File::Path;

rmtree($temptree);

However, it's even more succinct to say:


system "/bin/rm -r $temptree";

with the added benefit that you can even put the task in the background while you get on with something else:


system "/bin/rm -rf $temptree &";

I decided to suppress error output at the same time. You can decide whether this meets your requirements better than File::Path on a case-by-case basis; it's not portable, and the -f option may not prevent warnings about unremovable directories, in which case you would have to redirect STDERR.

For another example, imagine you need to know whether a file whose name is in $searchfile contains the string "haddock". You could code this in Perl:


open my $fh, $searchfile or die "Can't open $searchfile: $!\n";

my $found = 0;

while (<$fh>)

{

  $found = 1, last if /haddock/;

}

close $fh;

But you could also just do:


system "grep -q haddock $searchfile";

and test to see if $? is zero (means success; i.e., "haddock" was found). The pitfall is lack of portability, and not just to systems that don't have grep; not all greps have the -q (for "quiet") option, and some call it -s (for "silent").

Finally, there are legions of programs that for reasons of proprietary code or lack of interest on developers' parts are essentially black boxes and reverse engineering is not cost-effective.

4.9.6 Superfluous Truth

This miniature point is about storing boolean values. When you want to look up a value that is true or false, use Perl's built-in truth values instead of making up something more complex:

Not so good:


$found_fruit{apple} = ($rec =~ /apple/) ? 'yes' : 'no';

...

if ($found_fruit{apple} eq 'yes') ...

Better:


$found_fruit{apple} = $rec =~ /apple/;

...

if ($found_fruit{apple}) ...

Not so good:


$in_stock{$item} = 1;

...

if (exists $in_stock{pants}) ...

Better:


$in_stock{$item} = 1;

...

if ($in_stock{pants})...

Table of Contents