Ïðèãëàøàåì ïîñåòèòü
Êðûëîâ (krylov.lit-info.ru)

8.3 Improving Code with Modules

Previous Table of Contents Next

8.3 Improving Code with Modules

Vast amounts of legacy code have been expended duplicating functionality that now resides in modules. You can slim down programs drastically when you find code that you can excise in favor of a module that does the same job. Here are some common opportunities to look out for.

8.3.1 CGI.pm

For some reason, legions of programmers never got the word that there is a module for doing the common functions they need to perform in a CGI program. If you encounter a CGI program that's not using CGI.pm, unless it has some unusual performance requirements, do yourself a favor and convert it.

The most basic and important functionality of CGI.pm is to return the values supplied by a user to form inputs. When you see code like this:


foreach (split /&/, $ENV{QUERY_STRING}

{

  ($key, $val) = split /=/, $_, 2;

  s/%([A-Fa-f0-9]{2})/pack("c", hex($1))/ge for ($key, $val);

  $data{$key} = $val;

}

you're looking at (flawed) code for decoding (some) form inputs. (It might not be exactly the same as that.)

CGI.pm makes reading form inputs ridiculously easy. No matter what type the input field is, if its name is, say, myforminput, then your program can retrieve its value with:


use CGI qw(param);

my $myforminput = param('myforminput');

Even when handling a file upload for an input named, say, myfileupload, CGI.pm makes this as easy as could be:


use CGI qw(param);

my $upload = param('myfileupload');



# Use $upload as a string, get the file name:

use File::Basename;

my $saveto = basename($upload);



# Use $upload as a filehandle, get the upload contents:

open(OUT, '>', $saveto) or die $!;

print OUT while <$upload>;

close OUT;

You can get at all of the parameters at once using the Vars() method:


use CGI qw(Vars);

my %param = Vars();

CGI.pm also contains methods that can generate HTML, but don't feel you have to use them; I prefer a templating system such as HTML::Template so I can get an HTML expert to take care of the appearance of a site without any assistance from me.

The other CGI.pm methods that I use routinely with templates are header() to generate the HTTP header, and various reflective functions such as url() for finding out the URL my program was invoked with.

Some programs don't do the input decoding themselves but use an older library called cgi-lib.pl. This exports a function called ReadParse() that places form inputs in a package hash called %in. CGI.pm has a compatibility mode that allows you to migrate such scripts painlessly. Where the program currently says:


require 'cgi-lib.pl';

&ReadParse;

replace those lines with:


use CGI;

CGI::ReadParse;

That's it. Of course, a lot of useful CGI.pm functionality will be missing because it wasn't in cgi-lib.pl. For instance, multiple form inputs with the same name will show up in %in as a single value with null characters separating the different inputs. CGI.pm does it that way because that is what cgi-lib.pl does, whereas the CGI.pm param() method returns those inputs as a true list.

8.3.2 Date Parsing and Manipulation

So many people seem wedded to the idea of calling an external program to find out the date. Granted, this is the way it's done in shell scripts, but we can do better in Perl. You can:

  • Get the values for seconds, month, and so on, from localtime().

  • Format a date string any way you want, using the strftime() function exported from the (core) POSIX module.

  • Parse a date string in virtually any format into a UNIX time (seconds since epoch), using the (CPAN) Date::Parse module. For example:

    
    % perl -MDate::Parse -le \
    
     'print str2time("Thursday")'
    
    1057215600
    
    
  • Parse a date string in even more formats (at the expense of noticeable compilation time) using the (CPAN) Date::Manip module. For example:

    
    % perl -MDate::Manip -le \
    
     'print ParseDateString("Next Wednesday")'
    
    2003070900:00:00
    
    % perl -MDate::Manip -le \
    
     'print ParseDateString("Last Wednesday")'
    
    2003070200:00:00
    
    

    Also, you can perform numerous operations with Date::Manip, such as working with recurring events and finding out the dates of holidays.

  • Perform calculations on dates and times with the (CPAN) module Date::Calc. Life is too short to write yet more code for handling base-60 calculations. For example, to find the number of days between April Fool's Day and Canada Day in 2003:

    
    % perl -MDate::Calc=:all -le \
    
     'print Delta_Days(2003, 4, 1, 2003, 7, 1)'
    
    91
    
    

8.3.3 Socket.pm and IO::Socket

Socket programming in Perl has been simple since 1995, when Graham Barr's Socket.pm module hit the archives, and even simpler since 1996, when his IO::Socket module came out. Instead of using constants from a .ph file and using pack() and unpack(), with Socket.pm you can easily get, say, the IP address of a machine from its name:


% perl -Mstrict -Mwarnings -MSocket -l

my $addr = gethostbyname("www.perlmedic.com")

  or die "Lookup failed";

print inet_ntoa($addr);

^D

204.95.83.7

IO::Socket—and in particular, IO::Socket::INET—provide object-oriented packaging for sockets that make client/server programming a snap.

For a comprehensive treatment of just what you can do with IO::Socket, see [STEIN00].

8.3.4 HTML Parsing

Thousands of people want to parse HTML because they believe that's the only way to get at the result of some kind of operation they want to perform. Unfortunately, they're usually right. Face it, if what you want to know is, say, your bank balance, then just about the most annoying way you could think of for finding it out would be to submit several sets of inputs gleaned from successive HTML forms, finally parsing the quantity out of a morass of tags whose layout changes every other week. It would be much nicer if the bank provided you with an application program interface (API) like this:


$balance = get_balance($account, $PIN);

although, yes, it would be more likely to work like this:


$acct = BankAccount->new($username, $PIN) or die ...;

$balance = $acct->balance($account);

but we're still dreaming; banks don't do that. One day banks might become enlightened enough to provide SOAP interfaces, or at least some kind of XML API, but until then people like me who want to automate this task face some kind of HTML parsing.

You may not be accessing an online banking service, but that doesn't make any difference to the ease or difficulty of the task of parsing HTML. So find out first of all whether the creator of the pages you think you need to parse can provide them in a more palatable format. RSS, for instance, may be available for content resembling news stories.[10] If you are the creator of the HTML pages, don't force yourself to parse the HTML to extract content when you could provide an interface to that content some other way.

[10] RSS stands for RDF Site Summary; RDF stands for Resource Description Framework.

If you're stuck with no alternative but to parse HTML, you're in good company. So don't reinvent the wheel when so many people have constructed entire fleets of articulated trucks before you.

The temptation to parse HTML with regular expressions appears irresistible, judging by how many people do it. Not, to quote Seinfeld, that there's anything wrong with that. If you can accept the risks and limitations of that style of parsing, it can certainly be more succinct and readable than the alternatives. However, you need to know what the risks and limitations are in order to accept them.

The risks are that you will mislabel some part of the content; you might be looking for anchor tags, for instance, and accidentally find some that were embedded in comments or JavaScript that have nothing to do with what you were looking for. If you know the provenance of the content well enough to know that this has no chance of happening, then go ahead:


@links = $content =~ /<A\s+HREF\s*=\s*"(.*?)"/ig;

Or suppose you're scanning for form input tags, and you know the source well enough to know that the TYPE attribute will always precede the NAME attribute. Then you can use:


@inputs = $content =~ /<INPUT\s+TYPE="text"\s+NAME="(.*?)"/ig;

But the list of caveats for these examples builds up rapidly—are we sure that there will always be enclosing quotation marks, for instance? Browsers can work without them. When you get too nervous, it's time to invoke the no-nonsense power of a true HTML parser, such as HTML::Parser, HTML::TreeBuilder, or—if all you want is just hyperlinks—HTML::LinkExtor. For further reading, see [BURKE02]. Randal Schwartz also pointed out advantages to using XML::Parser to parse HTML in [SCHWARTZ03b].

8.3.5 URI Parsing

Uniform Resource Identifier (URI) is the proper technical term now for what 99 percent of the world still calls a Uniform Resource Locator (URL). Common operations you may need to perform on a URI include:

  • Extracting one of the components (scheme, path, server, etc.).

  • Changing one of those components.

  • Turning a relative URI into an absolute one.

Rather than trying to figure these things out via regular expressions, use the URI.pm module, which has methods for all of them.

8.3.6 Database Interaction

Too few modules that should be using proper databases really are using them. Whenever concurrent access is a possibility, in particular, a database may be appropriate, since a proper relational database will handle locking for you automatically and provide transaction semantics. Use DBI.pm and whichever of the myriad DBD:: family of modules speaks to the database of your choice. MySQL and PostgreSQL are two excellent free relational databases, so procurement cost need not be an obstacle. You will have to learn SQL (unless you use any of a number of CPAN modules that attempt to insulate you from that task), but it's not that difficult and the gains are well worth it if you're going to make much use of the database.

If concurrency is not an issue and you don't want to go through the trouble of putting up a database server, a DBM file offers significant advantages over a plain text file. You can use core modules such as DB_File to manage random access to the data, and store complex hierarchical structures with MLDBM (http://search.cpan.org/dist/MLDBM/). Best of all, you can do this using tie() as the interface, so if, say, you've got a program that is running out of memory to store a particular hash in, with a couple of lines you can tie it to a DBM file and instantly perform the trade-off of execution time for memory. (Just make sure that the code doesn't call keys() or values() in a list context, or use the hash in a list context, or it'll pull the whole database into memory to do so.)

If you want to do your own locking to handle concurrency, you can, but this is not easy to get right (see the section "Locking: The Trouble with fd" in the documentation for the DB_File module).

There are many modules on CPAN that provide abstraction layers on top of DBI. One such worth checking is Michael Schwern's Class::DBI, now maintained by Tony Bowden (http://search.cpan.org/dist/Class-DBI/).

For further reading, see [DESCARTES00].

8.3.7 Mail Processing

A plethora of modules exist to make mail sending and reading easy. For sending mail, Mark Overmeer's Mail::Send is very easy to use (http://search.cpan.org/dist/MailTools/). The same distribution includes Mail::Internet, a module for parsing mail messages into objects. Mark's Mail::Box (http://search.cpan.org/dist/Mail-Box/) is a new module that takes this concept and extends it to mail folders. For creating and parsing messages with attachments, use Eryq's MIME::Lite (http://search.cpan.org/dist/MIME-Lite/), maintained by Yves Orton.

To read mail from a server, use the core Net::POP3 module, or David Kernen's Mail::IMAPClient (http://search.cpan.org/dist/Mail-IMAPClient/).

I use Simon Cozens' Mail::Audit (http://search.cpan.org/dist/Mail-Audit/) in conjunction with Justin Mason's wildly popular Mail:: SpamAssassin (http://www.spamassassin.org/ and CPAN) to manage the flood of garbage that pollutes the mailbox of anyone with an address published in the Perl change log.

For further reading, see [STEIN00].

8.3.8 XML Manipulation

XML and Perl have a rich history of association. It is impossible to do justice to that in the space I have. XML is a complex technology spawning many even more complex technologies such as XSL, SOAP, WSDL, and so on. There are many modules for making XML easier; in particular see XML::Simple, XML::Parser, XML::TreeBuilder, XML::Writer, XML::SAX, XML::DOM, and XML::Grove. HTML::Parser has an XML parsing mode that doesn't depend on your having the expat program that XML::Parser requires. See Matt Sergeant's AxKit (http://axkit.org/) for a spectacular example of how much can be done with XML and Perl on a web server.

To really explore what you can do with Perl and XML takes a whole book; see [RAY02].

    Previous Table of Contents Next