Приглашаем посетить

Section 5.2. Handling English Text

5.2. Handling English Text

Most of the time when dealing with natural-language processing we don't really need any heavy, state-of-the-art language manipulation algorithms. Indeed, most of what we're doing with Perl involves merely throwing around different chunks of text.

5.2.1. Pluralizations and Inflections

Our introduction to handling English text comes from the perennial user interface disaster:

    You have 1 messages.

If you've been using (or perhaps writing) bad code for long enough, you might not see anything wrong with that, but it is actually somewhat grammatically lacking. Everyone at some point has written code that gets around the problem, perhaps a little like this:

    print "You have " . $messages . " message" . ($messages =  = 1 ? "" : "s") . ".\n";

This itself should already be looking like a candidate for modularization, but the problem gets worse:

    You have 2 messages in 2 mailboxs.

Another oops. We surely meant mailboxes. We could write another special case for the word mailbox, but what's really needed is a generic routine to make things agree with a number. Unfortunately, of course, due to the hideous complexity of the English language, this is a near-impossible task. Thankfully, the great Dr. Damian Conway speaks Australian English, simplifying the problem dramatically, and has produced the Lingua::EN::Inflect module.

This provides a whole host of subroutines, but perhaps the most useful for us are the PL, NO, and NUMWORDS routines.

The first subroutine, PL, provides a way to get at the plural form of a given word:

    % perl -MLingua::EN::Inflect=PL -le 'print "There are 2 ",PL("aide-de-camp")'
    There are 2 aides-de-camp

Additionally, you can pass in a number as well as a word to be pluralized, and PL will only do the pluralization if the number requires a plural.

    use Lingua::EN::Inflect qw(PL);
    for my $catcount (0..2) {
        print "I saw $catcount ", PL("cat", $catcount), "\n";
    }

    # I saw 0 cats
    # I saw 1 cat
    # I saw 2 cats

Now we're closer to solving our message/mailbox problem:

    print "You have $message ", PL("message", $message), " ",
          " in $mailbox ", PL("mailbox", $mailbox), "\n";

This is a little smarter, although there's a certain amount of repetition in there. This is where we move onto the next subroutine, NO. This combines the number with the appropriate plural and, additionally, translates "0" into the slightly more readable "no":

    use Lingua::EN::Inflect qw(NO);
    my $message = 0; my $mailbox = 4;

    print "You have ".NO("message", $message). " in ".
          NO("mailbox", $mailbox)."\n";

    # You have no messages in 4 mailboxes

I prefer a slightly more refined approach, which takes advantage of the fact that people find it easier to read numbers from one to ten in running text if they're spelled out. For this, we need to bring in the NUMWORDS subroutine, which converts a number to its English equivalent. My preferred pluralization routine looks like this:

    sub pl {
        my ($thing, $number) = @_;
        return NUMWORDS($number). " ".PL($thing, $number)
           if $number >= 1 and $number <= 10;

        NO($thing, $number);
    }

This handles "no cats," "one cat," "two cats," and "65 poets-in-residence" all perfectly well.

Inflections

The whole problem of inflections gets much harder when you're localizing an application for different languages. Sean M. Burke and Jordan Lachler wrote a good article on the subject about Locale::Maketexta module that helps you deal with localizations in a smart way. You can find the article at http://interglacial.com/~sburke/tpj/as_html/tpj13.html, or in the documentation for Locale::Maketext.

5.2.2. Converting Words to Numbers

The handy NUMWORDS subroutine from Lingua::EN::Inflect turns a number into English text for human-friendly display. A bunch of other modules on CPAN do roughly the same thing, including Lingua::EN::Numbers, Lingua::EN::Nums2Words, and Lingua::Num2Word.

However, if we're really doing natural language work and trying to extract meaning from a chunk of text, we are often called to do precisely the oppositeturn some English text representing a number into its computer-friendly set of digits. The best Perl module for this on CPAN is Joey Hess's Lingua::EN::Words2Nums.

Although it doesn't give you a regular expression for extracting numbers directly, once you have your number, it does a very thorough job of turning it into a digit string. The module exports the words2nums function, which does all the hard work:

    % perl -MLingua::EN::Words2Nums -e 'print words2nums("twenty-five")'
    25

I particularly like this module because it caters to the fact that I can't spell. So, if I misspell forty-two, words2nums still returns the desired result:

    % perl -MLingua::EN::Words2Nums -e 'print words2nums("fourty-two")'
    42

However, the fact that it can't scan through a text and return the first number it sees can be a bit of a pain. It's all very well if we're using it when prompting for a number:

    my $times;
    do {
       print "How many times should we repeat the process? ";
       $times = words2nums(scalar <STDIN>);
       last if defined $times;
       print "Sorry, I didn't understand that number.\n";
    } while 1;

But if, for instance, we want to write a supply chain program that automatically processes customer orders by email, we need to be able to scan through the text of the email to extract the numbers, so we can turn "I would like to buy forty-five copies of Advanced Perl Programming" into:

    $order = { quantity => 45, title => "Advanced Perl Programming" };

As it stands, Lingua::EN::Words2Nums won't let us do this; it wants the numbers pre-extracted. So we have to do a bit of trickery. Looking at how Lingua::EN::Words2Nums works, we see that it builds up a regular expression from a set of words:

    our %nametosub = (
        naught =>   [ \&num, 0 ],   # Cardinal numbers, leaving out the a
        nought =>   [ \&num, 0 ],
        zero =>     [ \&num, 0 ],   # ones that just add "th".
        one =>      [ \&num, 1 ],   first =>    [ \&num, 1 ],
    ...

    );

    # Note the ordering, so that eg, ninety has a chance to match before nine.
    my $numregexp = join("|", reverse sort keys %nametosub);
    $numregexp=qr/($numregexp)/;

This is a big help, but we can't, unfortunately, steal this regexp directly, for two reasons. First, it's in a private lexical variable, so we can't easily get at it. Second, Words2Nums also does some munging on the text separate to the regular expression, removing non-numbers like "and," hyphens, and so on. But we'll start by grabbing the expression and passing it through the wonderful Regex::PreSuf module to optimize it. This module generates a regular expression from a list of words that matches the same words as the original list. The result looks like this:

    (?-xism:((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|
    ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|
    ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|
    o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi
    (?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|
    [yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi
    (?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor
    (?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|
    vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|
    ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|
    y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|
    on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|
    y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s)))

It's a start. Now we have to extend this to allow for all the munging that words2nums does on the text. The important bits of the code are:

        s/\b(and|a|of)\b//g; # ignore some common words
        s/[^A-Za-z0-9.]//g;  # ignore spaces and punctuation, except period.

This is fine if we can change the text we're matching, but we don't necessarily want to do that. Instead, we have to construct a regular expression around our big optimized list of numbers that allows for and silently ignores these words and spaces. We also need to remember that we want to find numbers that are not in the middle of a word ("zone" does not mean a "z" followed by the number 1) so we use Perl's regular expression boundary condition (\b) to surround the final regexp. Here's what it looks like:

    my $ok_words = qr/\b(and|a|of)\b/;
    my $ok_things = qr/[^A-Za-z0-9.]/;
    my $number = qr/\b(($numbers($ok_words|$ok_things)*)+)\b/i;
    # Where $numbers is the big mad expression above.

Bundling this into a package with a couple of utility functions gives you the Lingua::EN::FindNumber CPAN module:

    use Lingua::EN::FindNumber;
    print numify("Fourscore and seven years ago, our four fathers...");

which prints out:

    87 years ago, our 4 fathers...

To go the other way, and turn numbers into words, there is a whole family of modules named Lingua::XX::Numbers where XX is the ISO language code of the language you want your numbers in: Lingua::EN::Numbers for English, for instance. There's also Lingua::EN::Numbers::Ordinate to turn "2" into "2nd". Similar modules exist for other languages.

Table of Contents