Section 5.4.  Categorization and Extraction

Table of Contents

5.4. Categorization and Extraction

It is no exaggeration to say that the most widely used application of NLP, and also the one with the greatest research effort these days, is the fight against spamdeciding on the basis of the content whether a particular email is wanted or unwanted. Some anti-spam software, such as SpamAssassin, takes a relatively straightforward approach to the problem: an email gets points if it contains certain words or phrases, and this is combined with the use of relay blacklists and other non-textual evidence.

However, anti-spam authors are increasingly taking an altogether different, and generally more successful, approach. They take one corpus of mail that is known to be spam and one that is known to be not spam (ham), and use statistical means to identify attributes that make a particular message more spammy or more hammy. When a new message comes in, the same statistical analysis is performed to determine its likely spamminess.

This is an application of the field of natural-language processing called document categorization. We have two categoriesspam and hamand want to determine which category a new document is likely to fall into. Naturally, this can be used for many other purposes, although the fight against spam is arguably the most urgent: automatically determining the language of a document; helping to route enquiries to the most appropriate department of a company; even organizing recipe collections.

We'll look at a few applications of document categorization, and the related field of information extraction.

5.4.1. Bayesian Analysis

Here's one application of document categorization that I found particularly useful. As we saw in the last chapter, I'm a big fan of receiving news and stories from web pages via RSS aggregation. The problem with subscribing to a variety of RSS feeds is that the quality and interest level of the stories varies wildly. While I like to know about upcoming Perl conferences through http://use.perl.org, I don't really care to be told about Perl Mongers meetings on the other side of the world.

So there are a set of news stories I find interesting and a set I find boring. I should not be having to make the decision about whether something is interesting or boring when I have several thousand dollars of processor power in front of me. This is a document categorization problem, not too dissimilar to the spam problem, so we can use the same techniques to solve it.

One technique that's found favor in anti-spam circles recently is Bayesian analysis. This is a very old technique, an application of the Bayesian theory of probability. It is used to invert conditional probabilities.

Bayes and Predicting Diseases

A quick introduction to Bayes's theorem is in order. For example, the chance that someone who has a particular rare genetic disease will test positive to a clinical test is 99%. Let's say John takes the test, and he tests positive. What are the chances that he has the disease? It's not 99%. Nothing like it. The short answer is that, at the moment, we can't say. We need more information.

Now suppose we know that the test yields a 5% false-positive rate. That is, if John doesn't have the disease, there's a 5% chance he'll test positive anyway. We also need to know how common the disease is; let's say it's 0.5% prevalent in the general population. (We'll say P(D) = 0.005 for the probability of the disease.)

Now we can work out the probability that any given test will be positive (P(T)). This is the combination of two situations: that the test is positive for someone who has the disease (0.005 * 0.99) and that the test is positive for someone who doesn't (0.995 * 0.05). This adds up to 4.975% + 0.495% = 5.47%. Should John be worried about this?

To find out, we use Bayes's theorem. This allows us to turn on its head the probability that a person tests positive given that he has the disease (P(T|D) = 0.99) and gives us P(D|T), the probability that he has the disease given that he tests positive.

Bayes's formula is:

    P(D|T) = [P(T|D) x P(D)] / P(T)

And plugging in the numbers, we find that John is only 9.05% likely to have the disease. Bayes's theorem is important because it forces us to take into account everything we knoweven though there was a relatively high chance that the test was accurate, we need to factor in the relative rarity of the disease in the general population.

We want to find out the probability that a document is interesting given the particular set of words in it. We know the probability that any document is interesting, regardless of its words; just count up the number of documents that are interesting, count up all the documents, and divide. We can work out the probability that an interesting document contains a given set of words, and so all that's left is a lot of tedious mathematics.

Thankfully, the Algorithm::NaiveBayes module can help to hide that away. It includes an analyzer that takes a number of attributes and corresponding weights, and associates them with a set of labels. Typically, for a text document, the attributes are the words of the document, and the weight is the number of times each word occurs. The labels are the categories under which we want to file the document.

For instance, suppose we have the news article:

    The YAPC::Taipei website is now open and calls for registration. Welcome to Taipei 
and join the party with us!

We split that up into words, count the occurrence of the relevant words, and produce a hash like this:

    my $news = {
       now => 1, taipei => 2, join => 1, party => 1, us => 1,
       registration => 1, website => 1, welcome => 1, open => 1,
       calls => 1, yapc => 1

Then we add this article to our categorizer's set of known documents:

    my $categorizer = Algorithm::NaiveBayes->new;
    $categorizer->add_instance( attributes => $news,
                                label     => "interesting" );

When we've added a load of documents, the categorizer has a good idea of what kind of words make up stories that I would categorize as interesting. We then ask it to do some sums, and we can find out what it thinks about a new document:

    my $probs = $categorizer->predict(  attributes => $new_news );

This returns a hash mapping each category to the probability that the document fits into that category. So once we have our categorizer trained with a few hand-categorized documents, we can say:

    if ( $probs->{interesting} > 0.5 ) {
        # Probably interesting

Of course, now the main problem is getting the document into the hash of words and weights. We use our now-familiar Lingua::EN::Splitter and Lingua::EN::StopWords techniques to produce a subroutine like this:

    sub invert_string {
       my ($string, $weight, $hash) = @_;
       $hash->{$_} += $weight for
            grep { !$StopWords{$_} }

This inverts a string, adding its component words into a hash. In our RSS aggregator example, the stories will be hashes extracted from XML::RSS objects. We want to give more weight to words that appear in the title of a story than those that appear in the body, so we end up with this:

    sub invert_item {
        my $item = shift;
        my %hash;
        invert_string($item->{title}, 2, \%hash);
        invert_string($item->{description}, 1, \%hash);
        return \%hash;

With these subroutines in place, we can now train our analyzer on two carefully constructed RDF sources:


    use XML::RSS;
    use Algorithm::NaiveBayes;
    use Lingua::EN::Splitter qw(words);
    use Lingua::EN::StopWords qw(%StopWords);

    my $nb = Algorithm::NaiveBayes->new(  );

    for my $category (qw(interesting boring)) {
        my $rss = new XML::RSS;
        $nb->add_instance(attributes => invert_item($_),
                          label      => $category) for @{$rss->{'items'}};

    $nb->train; # Work out all the probabilities

And now we can ask it how we feel about the articles in a third source:

    my $target = new XML::RSS;
    for my $item (@{$target->{'items'}}) {
        print "$item->{title}: ";

        my $predict = $nb->predict(attributes => invert_item($item));
        print int($predict->{interesting}*100)."% interesting\n";

If we train the categorizer from two weblogs written by Bob (interesting) and George (boring) as the data sources and then compare it with the Slashdot feed from a random day, it predicts the following:

    Elektro, the Oldest U.S. Robot: 12% interesting
    French Court Orders Google to Stop Competing Ad Displays: 0% interesting
    Personal Spaceflight Leaders Form New Federation: 43% interesting
    Symantec Antivirus May Execute Virus Code: 0% interesting
    Open-Source Technique for GM Crops: 99% interesting
    North Korea Admits to Having Nuclear Weapons: 19% interesting

    Judge Slams SCO's Lack of Evidence: 24% interesting
    Sci-Fi Channel Renews Battlestar Galactica: 0% interesting
    Tecmo Sues Game Hackers Under DMCA: 15% interesting
    Identifying World's Species With Genetic Bar Codes: 32% interesting

This is a very simple case of categorization where we've only really used one dimensioninteresting versus boring. Now we'll look at a more complex example, using multiple dimensions and a more sophisticated algorithm.

5.4.2. Keyword Extraction and Summary

One extremely common task in language processing is the reduction of a text down to essential keywords or phrases. Perl people are busy people and don't want to read through masses of documents if they don't need to. Why not get a computer to do the bulk of the reading for them?

There are two ways to getting a sense of what a document is about. The first is to extract what appear to be key sentences and strip away anything that doesn't seem so important. This is particularly good for summarizing scientific or factual input. The second is to look for particularly important keywords, nouns, and noun phrases that recur frequently in a text. This doesn't give a readable overview of the sense of the text, but it is often enough to give a good overview of the key concepts involved.

Lingua::EN::Summarize is the module for the first approach, extracting key sentences. Digging into the source, we see it takes a relatively naive approach by looking for clauses that could be declarative:

    my $keywords = "(is|are|was|were|will|have)";
    my @clauses = grep { /\b$keywords\b/i }
                   map { split /(,|;|--)/ } split_sentences( $text );

It then tries to trim them to a required length. To be honest, it isn't great. If we run the summarize subroutine on the first part of this chapter:

    use Lingua::EN::Summarize;
    print summarize($chapter5, maxlength => 300, wrap => 70);

it produces the following summary:

    Is the application of AI techniques to answer questions about text
    written in a human language: what does it mean. What other documents
    is it like. As Perl is often described as a text-processing
    language. It shouldn't be much of a surprise to find that there are a
    great many modules and techniques available in Perl for carrying out
    NLP-related tasks.

To make a better text summarizer, we need to turn to the academic literature, and there's a lot of it. A paper by Kupiec, Pedersen, and Chen[*] uses a Bayesian categorizer, like the one we used to categorize RSS stories, to determine whether or not a given sentence would be in a summary. Implementing this using Algorithm::NaiveBayes should be a relatively straightforward exercise for the interested reader.

[*] Julian Kupiec, Jan O. Pedersen, and Francine Chen. 1995. "A Trainable Document Summarizer." In Research and Development in Information Retrieval, pages 68-73. Available at http://www.dcs.shef.ac.uk/~mlap/teaching/kupiec95trainable.pdf.

Another technique (by Hans Peter Luhn[Section 5.4.  Categorization and Extraction]) uses something called Zipf's Law of Word Distribution, which states that a small corpus of words occur very frequently, some words occur with moderate frequency, and many words occur infrequently. To summarize a text, you take the sentences that contain words that appear very frequently. Unfortunately, this requires a very large corpus of related material, in order to filter out relatively common phrases, but let's implement this in Perl.

[Section 5.4.  Categorization and Extraction] Hans Peter Luhn. 1958. "The Automatic Creation of Literature Abstracts." IBM Journal of Research and Development, 2(2):159-165. Reprinted in H.P. Luhn: Pioneer of Information Science, Selected Works. Edited by Claire K. Schultz. Spartan Books, New York. 1968.

I grabbed a bunch of technical papers on computational linguistics from http://www.arxiv.org and built up background and document-specific frequency histograms using the usual techniques:

    use Lingua::EN::Splitter qw(words);
    use Lingua::EN::Sentence qw(get_sentences);
    use File::Slurp;
    use Lingua::Stem::En;
    use Lingua::EN::StopWords qw(%StopWords);
    use List::Util qw(sum);
    use strict;
    my %base;
    my %per_file;

    my $amount = shift;
    for my $file (<*.txt>) {
        my $sentences = get_sentences ( scalar read_file($file) );
        for my $sentence (@$sentences) {
            my @words = grep { !$StopWords{$_} } @{words(lc $sentence) };
            for my $word (@{ Lingua::Stem::En::stem({ -words => \@words }) }) {

Now we convert these histograms into frequency tables, by dividing the count for each word by the total number of hits in the hash:

    my $sum = sum values %base; $base{$_} /= $sum for keys %base;
    my %totals;

    for my $file (keys %per_file) {
        $sum = sum values %{$per_file{$file}};
        $per_file{$file}{$_} /= $sum for keys %{$per_file{$file}};

Now that we have our frequencies of words relative to the corpus as a whole and relative to individual documents, we can start our second pass: look over a document and score each sentence in terms of the average relative unusualness of its constituent words. We begin as before:

    for my $file (<*.txt>) {
        print $file,":\n";
        my $sentences = get_sentences (scalar read_file($file) );
        my %markings;
        my $order = 0;
        for my $sentence (@$sentences) {
            my @words = grep { !$StopWords{$_} } @{words(lc $sentence) };

But this time we want to mark the sentence with its order in the document and a score for each word; this is the ratio between the expected frequency and the observed frequency:

            my @words = grep { !$StopWords{$_} } @{words(lc $sentence) };
            $markings{$sentence}->{order} = $order++;
            if (!@words) {
              $markings{$sentence}->{score} = 0;
            for my $word (@{ Lingua::Stem::En::stem({ -words => \@words }) }) {
                my $score = $per_file{$file}{$word} / $base{$word};
                $markings{$sentence}->{score} += $score;

Finally, we divide the sentence's score by the number of words to find the average and so that we're not unfairly favoring longer sentences.

            $markings{$sentence}->{score} /= @words;

Now we have a score for each sentence in the document. We can sort the sentences by score to find the 10 highest-scoring, then re-sort them by order so that they appear in the original order:

        my @sorted = sort
                     { $markings{$b}->{score} <=> $markings{$a}->{score} }
                     keys %markings;
        my @selected = sort
                     { $markings{$a}->{order} <=> $markings{$b}->{order} }
        print "@selected\n\n";

And that's all we need for a simple frequency-based summarizer. To take an example of the output, the summarizer reduced a 2,000-word paper on language generation down to the following abstract:

This paper introduces the neca mnlg; a Multimodal Natural Language Generator. The neca mnlg adopts the conventional pipeline architecture for generators (Reiter and Dale, 2000). Feature structures are also used in the fuf/surge generator. See http://www.cs.bgu.ac.il/surge/. Matching trees might in turn have incomplete daughter nodes. These are recursively expanded by matching them with the trees in the repository, until all daughters are complete. The module is based on the work of Krahmer and Theune (2002). The negation is passed on to the VP subtree via the feature. The attributes x and y allow us to capture unbounded dependencies via feature perlocation. The value of the attribute is passed on from the mother node to the daughter nodes. The neca mnlg has been implemented in prolog. Keyword extraction

The second approach to document summary, extracting keywords, is a little easier, and it is instructive to have a look at the Lingua::EN::Keywords module and how its techniques evolved.

The first versions of Lingua::EN::Keywords were very simple. They split the text into words manually, then counted up the frequency of each word, and returned the top five most frequent words. This was, of course, pretty bad.

The second set of versions used a hardwired stopword list, which was a little better, but still not good enough. The third series used the Lingua::EN::StopWords techniques we've seen several times in this chapter. The fourth iteration used the same technique as "Keyword Extraction and Summary" earlier in this chapter to detect and score proper names that wouldn't appear in the dictionary, rather than ordinary words that would. This was because proper nouns are more easily identifiable as the topic of a document.

And there it sat for a while as I looked for ways to extract not just words but key phrases. There are various academic papers about how to do this, but, to be honest, I couldn't get any of them to work accurately. Finally, something caught my eye: Maciej Ceglowski and his team at Middlebury College released Lingua::EN::Tagger, a rather nice English part-of-speech tagger, which contained a noun-phrase extractor. This produces histograms of all the nonstop words in the text, but also tries to group together nouns into phrases:

    use Lingua::EN::Tagger;
    my $tag = Lingua::EN::Tagger->new(longest_noun_phrase => 5,
                                      weight_noun_phrases => 0);

    my %wordlist = $tag->get_words("This is a test of the emergency warning
    system. This is only a test. If this had been an actual emergency, you
    would have been instructed to tune to either Cable Channel 3 or
    local emergency radio stations.");

This produces the following histogram in our word list %wordlist:

    'emergency warning system' => 1,
    'emerg' => 3,
    'cable channel' => 1,
    'warn' => 1,
    'radio stations' => 1,
    'actual emergency' => 1,
    'local emergency radio stations' => 1,
    'radio' => 1,
    'emergency radio stations' => 1,
    'emergency radio' => 1,
    'channel' => 1,
    'warning system' => 1,
    'emergency warning' => 1,
    'station' => 1,
    'system' => 1,
    'test' => 2

As you can see, individual words (emergency) are stemmed (emerg), but noun phrases are kept together. If we collate these histograms over a large document, nouns that appear together as a phrase will rise to the top. For instance, if the document always talks about "local football leagues," then that phrase will be ranked highly. However, if there's a relatively even split between "local football leagues" and "local football matches," then "local football" will be the most frequent phrase.

With this module available, Lingua::EN::Keywords just becomes a matter of collating the histograms and spitting back the most common phrases, with one slight exception. The problem is that when we constructed the histogram, we stemmed the words. To present these words to the user, we need to unstem them, but that's not very easy. The way we do this in Lingua::EN::Keywords is to cache every stemming in a hash, then reverse the hash to go from a stemmed word to the original. By subclassing Lingua::EN::Tagger, we can override the stem method, and provide another method to get access to the cache:

    package My::Tagger;
    use base 'Lingua::EN::Tagger';
    my %known_stems;
    sub stem {
        my ( $self, $word ) = @_;
        return $word unless $self->{'stem'};
        return $known_stems{ $word } if exists $known_stems{$word};
        my $stemref = Lingua::Stem::En::stem( -words => [ $word ] );

        $known_stems{ $word } = $stemref->[0] if exists $stemref->[0];

    sub stems { reverse %known_stems; }

Now we can write a trivial unstem routine that looks up the original word from its stem in the reversed hash, shown in the next block of code.

    use My::Tagger;
    my $tag = My::Tagger->new(longest_noun_phrase => 5,
                              weight_noun_phrases => 0);

    sub unstem {
        my %cache = $tag->stems;
        my $stem = shift;
        return $cache{$stem} || $stem;

This isn't a perfect solution. If two different words have the same stem, the distinction is lost in the reverse operation. A few modifications to the stems subroutine can keep a record of the exact variations found in the original text if that information is important. Finally, the keyword subroutine is as simple as:

    sub keywords {
        my %wordlist = $tag->get_words(shift);
        my %newwordlist;
        $newwordlist{unstem($_)} += $wordlist{$_} for keys %wordlist;
        my @keywords = sort { $newwordlist{$b} <=> $newwordlist{$a} } keys %newwordlist;
        return $keywords[0..4];

This generates a Lingua::EN::Tagger histogram from the single string argument. It then creates a new hash combining the counts from the original histogram with the unstemmed keywords. It sorts the keywords by frequency and returns the five most common keywords. If we pass the keywords subroutine the same text as the first example in this section, the first two results are emergency and test. The remaining three keywords of the top five aren't in a predictable order because they're all equally frequent.

5.4.3. Extracting Names and Places

An extension of the principle of extracting keywords from text involves what's known in the trade as named-entity extraction. A named entity is any proper noun, such as a place, a person's name, an organization's name, and so on. Extracting these proper nouns and categorizing them is a good first step to retrieving pertinent information from a particular source text.

There are any number of well-documented algorithms for named-entity extraction in the academic literature, but, being Perl hackers, we're interested in a quick, easy but efficient ready-made solution.

There are two modules that perform named-entity extraction for English text: GATE::ANNIE::Simple and Lingua::EN::NamedEntity. GATE::ANNIE::Simple is an Inline::Java wrapper around the GATE[*] project's ANNIE[Section 5.4.  Categorization and Extraction] named-entity extraction engine. The GATE documentation talks of this use of ANNIE as a "textual sausage machine"--you feed text in at one end, and out the other end comes a categorized set of entities. This is the sort of solution we're looking for.

[*] General Architecture for Text Engineering (http://www.gate.ac.uk).

[Section 5.4.  Categorization and Extraction] A Nearly-New Information Extraction system. When it comes to acronymic naming schemes, academics can be just as bad as hackers.

    use GATE::ANNIE::Simple;
    $text = <<EOF;
    The United States is to ask the United Nations to approve the creation
    of a multinational force in Iraq in return for ceding some political
    authority, US officials say.
    %entities = annie_extract($text);

This results in the following entries in the %entities hash:

    'Iraq' => 'Location',
    'United Nations' => 'Organization',
    'US' => 'Location',
    'United States' => 'Location'

ANNIE has very high recall and categorization accuracy, meaning that it can find most of the relevant entities in a document (and doesn't find too many things that aren't actually named entities) and can decide whether the entity is a person, a place, an organization, and so on. It can also extract dates, numbers, and amounts of money. The downside of ANNIE is its complexity, requiring a bridge to Java, and its speedbefore the first set of extraction, ANNIE must first load all its rulesets into memory, something that can take a matter of minutes even on a fast machine.

When looking at alternatives to ANNIE for the purposes of writing this chapter, I started writing about a relatively naive approach:

  • First, extract all phrases of capitalized words, storing the context two or three words before and two or three words after.

  • Second, trim out those that are purely sentence-initial dictionary words, using a stemmer (see above) to remove inflections.

  • Third, allocate a score to each entity in each of the following categories: person, place, organization. If the first word in the entity is found in a lexicon of forenames, score it up as a person, and so on.

Then I realized that by the time I'd finished describing this heuristic in English, I could have already written it in Perl, and Lingua::EN::NamedEntity was born. It was originally designed to be a benchmark against which more academically correct solutions could be measured. Although it doesn't have the same precision as ANNIE, Lingua::EN::NamedEntity actually turned out to be good enough for most named-entity work.

When you install the module from CPAN, it sets up a series of files in the .namedentity subdirectory of your home directory, precomputing the forename and lexicon databases to speed up their loading. Once it's all installed and set up, you can use the extract_entities subroutine to produce a list of named entities.

As with most information-extraction techniques, named-entity recognition works well on reasonably long documents, so I ran it on a Perl 6 Summary:

    use File::Slurp;
    use Lingua::EN::NamedEntity;

    my $file = read_file("summary.txt");
    for my $ent (extract_entities($file)) {
        print $ent->{entity}, ": ", $ent->{class}, "\n";

The results weren't perfect, as you'll see:

    Go Melvin: person
    Larry and Damian: organisation
    Hughes: place
    Perlcentricity: place
    Jonathan Worthington: person
    Melvin Smith: person
    Meanwhile Melvin Smith: person
    Piers Cawley: person
    Austin Hastings: person
    Perl Foundation and the Perl: organisation
    Simon Cozens: person
    Robin Redeker: person
    Simon: person
    Right Thing: person
    Leo T: person
    Leon Brocard: person
    Payrard: place
    Perl: place
    London: place

However, one thing to note is that Lingua::EN::NamedEntity ends up erring on the side of recklessness, much more so than ANNIEit will extract a lot of entities, and usually end up catching all the entities in a document, even if it does extract a few things that aren't really entities.

There are three main plans for future work on the module. The first is to tighten up the categorization somewhat and extend the number of categories. At the moment, for instance, it filed Perl as a place, because the summary talks so much about things being "in Perl."

The second is to remove some false positives by checking for supersets of supersets of entities: for instance, seeing "Melvin Smith" as an entity should make it less likely that "Meanwhile Melvin Smith" is really a person. Conversely, the third plan is to correlate subsetsnot only should it know that "Melvin Smith" and "Melvin" are the same person, but it should use the combined ranking to strengthen its certainty that both belong in the person category.

    Table of Contents
    © 2000- NIV