Приглашаем посетить

Section 5.3. Modules for Parsing English

5.3. Modules for Parsing English

Parsing ordinary written text is perhaps the ultimate goal of any natural-language processing system, and, to be honest, we're still a long way from it at the moment.

Even so, there are a good number of modules on CPAN that can help us deal with understanding what's going on in a chunk of text.

5.3.1. Splitting Up Text

There are many scenarios in which a large document needs to be split up into some kind of chunks. This can vary from splitting out individual words, to splitting out sentences and paragraphs, and right up to splitting a document into logical subsectionsworking out which sets of paragraphs refer to a common topic and which others are unrelated.

We'll begin with splitting up sentences, since there are a variety of ways to do this. The naive approach is to assume that a period, question mark, or exclamation mark followed by whitespace or the end of text is the end of a sentence, and to use punctuation and capitals to help this determination. This is what Text::Sentence does, and it's not bad:

    use Text::Sentence qw( split_sentences );
    my $text = <<EOF;
    This is the first sentence. Is this the second sentence? This is the
    third sentence, with an additional clause!
    EOF
    print "#$_\n\n" for split_sentences($text);

This prints out:

    #This is the first sentence.

    #Is this the second sentence?

    #This is the third sentence, with an additional clause!

This punctuation-based assumption is generally good enough, but screws up messily on sentences containing abbreviations followed by capital letters, e.g., This one. It incorrectly identifies the boundary between the punctuation and the capital letter as a sentence boundary:

    #This punctuation-based assumption is generally good enough, but screws
    up messily on sentences containing abbreviations followed by capital
    letters, e.g.,

    #This one.

Thankfully, the exceptions are sufficiently rare that even if you're doing some kind of statistical analysis on your sentences, with a big enough corpus the effect of the assumption failing is insignificant. For cases where it really does matter, though, Shlomo Yona's Lingua::EN::Sentence does a considerably better job:

    use Lingua::EN::Sentence qw( get_sentences add_acronyms );
    my $text = <<EOF;
    This punctuation-based assumption is generally good enough, but screws
    up messily on sentences containing abbreviations followed by capital
    letters, e.g., This one. Shlomo Yona's Lingua::EN::Sentence does a
    considerably better job:
    EOF
    my $sentences=get_sentences($text);
    foreach my $sentence (@$sentences) {
         print "#", $sentence, "\n\n";
    }

The result of this example is:

    #This punctuation-based assumption is generally good enough, but screws
    up messily on sentences containing abbreviations followed by capital
    letters, e.g., This one.

    #Shlomo Yona's Lingua::EN::Sentence does a considerably better job:

For things that aren't sentences, my favorite segmentation module is Lingua::EN::Splitter; this can handle paragraph- and word-level segmentation, and its cousin Lingua::Segmenter::TextTiling takes a stab at clustering paragraphs into discrete sections of a document.

The paragraph and word segmentation are done using fairly simple regular expressions, but the paragraph clustering is done using a technique invented by Marti Hearst called TextTiling. This measures the correlation of particular words in order to detect sets of paragraphs with distinct vocabularies.

We'll use Lingua::En::Splitter's words routine often in this chapter. It's an excellent building block for analyzing texts, as in this simple concordancer for generating histograms of word-frequency:

    use Lingua::EN::Splitter qw(words);

    my $text = "Here is Edward Bear, coming downstairs now, bump, bump,
    bump, on the back of his head, behind Christopher Robin.";

    my %histogram;
    $histogram{lc $_}++ for @{ words($text) };
    use Data::Dumper; print Dumper(\%histogram);

This example correctly counts up three occurrences of bump, and one each of the other words:

    $VAR1 = {
              'robin' => 1,
              'here' => 1,
              'edward' => 1,
              'now' => 1,
              'bear' => 1,
              'coming' => 1,
              'head' => 1,
              'his' => 1,
              'downstairs' => 1,
              'of' => 1,
              'bump' => 3,
              'on' => 1,
              'the' => 1,
              'behind' => 1,
              'back' => 1,
              'is' => 1,
              'christopher' => 1
            };

5.3.2. Stemming and Stopwording

Of course, merely building up a histogram of words isn't enough for most serious analyses; our job is complicated by two main factors. First, there's the fact that most languages have some system of inflection where the same root word can appear in multiple forms.

For instance, if you're trying to analyze a mass of scientific articles to find something about what happens when volcanos erupt, you want to find all those that speak about "volcanic eruption," "volcano erupting," "volcanos erupted," and so on. While these are quite obviously different words, we want them all to be treated the same for the purposes of searching.

The usual process for doing this is to stem the words, pruning them back to their roots: all of the "volcanos erupting" phrases should be pruned back to "volcano erupt" or similar. Porter's stemming algorithm, invented by Martin Porter at Cambridge University and first described in the paper An algorithm for suffix stripping is by far the most widely used algorithm for stemming English words.^[*]

^[*] That doesn't mean, of course, that it's particularly good. Porter himself says: "It is important to remember that the stemming algorithm cannot achieve perfection. On balance it will (or may) improve IR [information retrieval] performance, but in individual cases it may sometimes make what are, or what seem to be, errors."

However, as with the infamous Brill part of speech tagger, once something gets established as the de facto standard tool in NLP, it's very hard to shift it.

Benjamin Franz has implemented a generic framework for stemmers such as the Porter algorithm in Lingua::Stemmer; it contains stemming algorithms for many languages, but we'll look at Lingua::Stem::En for the moment.

For a module later in this chapter, I needed to know if a particular word was a dictionary word, as opposed to some kind of personal noun. Of course, thanks to inflections, there are plenty of "dictionary" words that aren't in the dictionary. I employed a Porter stemmer to catch these.

First we need to stem all the words in the dictionary, or else they aren't going to match the stemmed versions we're looking for:

    sub stem {
        require Lingua::Stem::En;
        my ($stemmed) = @{ Lingua::Stem::En::stem({ -words => [shift] }) };
    }

     while (<DICT>) {
        chomp;
        next if /[A-Z]/;
        $wordlist{stem($_)}=1;
    }

We actually make %wordlist a tied hash to a DBM file, so that we only need to stem the dictionary once, no matter how many times we look up words in it. Once that's done, we can now remove all the dictionary words from a list:

    my @proper = grep { !$wordlist{$_} }
        @{ Lingua::Stem::En::stem({ -words => \@words }) };

Similarly, the Plucene Perl-based search engine has an analyzer that stems words so that searches for "erupting" and "erupt" give the same results.

The second problem that arises is that there are a large number of English words that don't carry very much semantic content. For instance, you probably wouldn't miss much from that previous sentence if it were transformed into "The second problem arises large number English words don't carry semantic content." Words like "are," "that," and "very" are called stopwords.

Stopwords don't add much to the underlying meaning of an utterance. In fact, if we're trying to wade through English text with Perl, we probably want to get rid of any such words and concentrate on the ones that are left.

The Lingua::EN::StopWords module contains a handy hash of stopwords so that you can quickly look up whether a word has weight:

     use Lingua::EN::StopWords qw(%StopWords);

     my @words = qw(the second problem that arises is that there are a
     large number of English words that don't carry very much semantic
     content);

     print join " ", grep { !$StopWords{$_} } @words;

    second problem arises large number English words don't carry
    semantic content

By combining these two modules and Lingua::EN::Splitter, we can get some kind of a metric of the similarity of two sentences:

    use Lingua::EN::StopWords qw(%StopWords);
    use Lingua::Stem::En;
    use Lingua::EN::Splitter qw(words);
    use List::Util qw(sum);
    print compare(
        "The AD 79 volcanic eruption of Mount Vesuvius",
        "The volcano, Mount Vesuvius, erupted in 79AD"
        );

    sub sentence2hash {
        my $words   = words(lc(shift));
        my $stemmed = Lingua::Stem::En::stem({
                        -words => [ grep { !$StopWords{$_} } @$words ]
                      });
        return { map {$_ => 1} grep $_, @$stemmed };
    }

    sub compare {
        my ($h1, $h2) = map { sentence2hash($_) } @_;
        my %composite = %$h1;
        $composite{$_}++ for keys %$h2;
        return 100*(sum(values %composite)/keys %composite)/2;
    }

The compare subroutine tells us the percentage of compatibility between two sentencesin this example, 83%. The sentence2hash subroutine first splits a sentence into individual words, using Lingua::EN::Splitter. Then, after grepping out the stopwords, it stems them, makes sure there's something left after stemming (to get rid of non-words like "79"), and maps them into a hash.

The compare subroutine simply builds up a hash that contains all the stemmed words and the number of times they appear in the two sentences. If the sentences mesh perfectly, then each word will appear precisely twice in the composite hash, and so the average value of the hash will be 2. To find the compability of the sentences, we divide the average by 2, and multiply by 100 to get a percentage.

In this case, our two sentences only differed by the fact that the Porter stemmer didn't stem "volcano" to "volcan" as it did for "volcanic." It's not perfect, but it's good enough for NLP.

Table of Contents