Документация
HTML CSS PHP PERL другое

Section 8.6.  The Match Variables

 
Previous
Table of Contents
Next

8.6. The Match Variables

So far, when we've put parentheses into patterns, they've been used only for their ability to group parts of a pattern together. But parentheses also trigger the regular expression engine's memory. The memory holds the part of the string matched by the part of the pattern inside parentheses. If there are more than one pair of parentheses, there will be more than one memory. Each regular expression memory holds part of the original string, not part of the pattern.

Since these variables hold strings, they are scalar variables; in Perl, they have names like $1 and $2. There are as many of these variables as there are pairs of memory parentheses in the pattern. As you'd expect, $4 means the string matched by the fourth set of parentheses. [Section 8.6.  The Match Variables]

[Section 8.6.  The Match Variables] This is the same string that the backreference \4 would refer to during the pattern match. But these aren't two different names for the same thing; \4 refers back to the memory during the pattern while it is trying to match, and $4 refers to the memory of an completed pattern match. For more information on backreferences, see the perlre manpage.

These match variables are a big part of the power of regular expressions because they let us pull out the parts of a string:

    $_ = "Hello there, neighbor";
    if (/\s(\w+),/) {             # memorize the word between space and comma
      print "the word was $1\n";  # the word was there
    }

Or you could use more than one memory at once:

    $_ = "Hello there, neighbor";
    if (/(\S+) (\S+), (\S+)/) {
      print "words were $1 $2 $3\n";
    }

That tells us that the words were Hello there neighbor. Notice that there's no comma in the output. Because the comma is outside of the memory parentheses in the pattern, there is no comma in memory two. Using this technique, we can choose what we want in the memories, as well as what we want to leave out.

You could have an empty match variable[*] if that part of the pattern might be empty. That is, a match variable may contain the empty string:

[*] As opposed to an undefined one. If you have three or fewer sets of parentheses in the pattern, $4 will be undef.

    my $dino = "I fear that I'll be extinct after 1000 years.";
    if ($dino =~ /(\d*) years/) {
      print "That said '$1' years.\n";  # 1000
    }

    $dino = "I fear that I'll be extinct after a few million years.";
    if ($dino =~ /(\d*) years/) {
      print "That said '$1' years.\n";  # empty string
    }

8.6.1. The Persistence of Memory

These match variables generally stay around until the next successful pattern match.[Section 8.6.  The Match Variables] That is, an unsuccessful match leaves the previous memories intact, but a successful one resets them all. This correctly implies that you shouldn't use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. The following (bad) example is supposed to print a word matched from $wilma. But if the match fails, it's using whatever leftover string happens to be found in $1:

[Section 8.6.  The Match Variables] The scoping rule is more complex (see the documentation if you need it), but as long as you don't expect the match variables to be untouched many lines after a pattern match, you shouldn't have problems.

    $wilma =~ /(\w+)/;  # BAD! Untested match result
    print "Wilma's word was $1... or was it?\n";

This is another reason a pattern match is almost always found in the conditional expression of an if or while:

    if ($wilma =~ /(\w+)/) {
      print "Wilma's word was $1.\n";
    } else {
      print "Wilma doesn't have a word.\n";
    }

Since these memories don't stay around forever, you shouldn't use a match variable like $1 more than a few lines after its pattern match. If your maintenance programmer adds a new regular expression between your regular expression and your use of $1, you'll be getting the value of $1 for the second match, rather than the first. For this reason, if you need a memory for more than a few lines, copy it into an ordinary variable. Doing this helps make the code more readable at the same time:

    if ($wilma =~ /(\w+)/) {
      my $wilma_word = $1;
      ...
    }

Later, in Chapter 9, you'll see how to get the memory value directly into the variable at the same time as the pattern match happens, without having to use $1 explicitly.

8.6.2. The Automatic Match Variables

There are three more match variables that you get free,[*] whether the pattern has memory parentheses or not. That's the good news; the bad news is that these variables have weird names.

[*] Yeah, right. There's no such thing as a free match. These are "free" only in the sense that they don't require match parentheses. Don't worry; we'll mention their real cost a little later.

Larry probably would have been happy enough to call these by slightly less weird names, like perhaps $gazoo or $ozmodiar. But those are names you might want to use in your own code. To keep ordinary Perl programmers from having to memorize the names of all of Perl's special variables before choosing their first variable names in their first programs,[Section 8.6.  The Match Variables] Larry has given strange names to many of Perl's built-in variables, names that break the rules. In this case, the names are punctuation marks: $&, $`, and $'. They're strange, ugly, and weird, but those are their names.[Section 8.6.  The Match Variables] The part of the string that matched the pattern is automatically stored in $&:

[Section 8.6.  The Match Variables] You should still avoid a few classical variable names like $ARGV, but these few are in all-caps. All of Perl's built-in variables are documented in the perlvar manpage.

[Section 8.6.  The Match Variables] If you can't stand these names, check out the English module, which attempts to give all of Perl's strangest variables nearly normal names. But the use of this module has never really caught on; instead, Perl programmers have grown to love the punctuation-mark variable names, strange as they are.

    if ("Hello there, neighbor" =~ /\s(\w+),/) {
      print "That actually matched '$&'.\n";
    }

The part that matched was "there," (with a space, a word, and a comma). Memory one, in $1, has the five-letter word there, but $& has the entire matched section.

Whatever came before the matched section is in $`, and whatever was after it is in $'. Another way to say that is that $` holds whatever the regular expression engine had to skip over before it found the match, and $' has the remainder of the string that the pattern never got to. If you glue these three strings together in order, you'll always get back the original string:

    if ("Hello there, neighbor" =~ /\s(\w+),/) {
      print "That was ($`)($&)($').\n";
    }

The message shows the string as (Hello)( there,)( neighbor), showing the three automatic match variables in action. Any or all of these three automatic match variables may be empty like the numbered match variables. And they have the same scope as the numbered match variables. Generally, that means they'll stay around until the next successful pattern match.

Now, we said earlier that these three are "free." Well, freedom has its price. In this case, the price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly[*]. Now, this isn't a giant slowdown, but it's enough of a worry that many Perl programmers will never use these automatic match variables.[Section 8.6.  The Match Variables] Instead, they'll use a workaround. For example, if the only one you need is $&, put parentheses around the whole pattern and use $1 instead. (You may need to renumber the pattern's memories.)

[*] For every block entry and exit, which is practically everywhere

[Section 8.6.  The Match Variables] Most of these folks haven't benchmarked their programs to see if their workarounds save time; it's as though these variables were poisonous or something. But we can't blame them for not benchmarking; many programs that could benefit from these three variables take up only a few minutes of CPU time in a week, so benchmarking and optimizing would be a waste of time. But in that case, why fear a possible extra millisecond? By the way, the Perl developers are working on this problem, but there will probably be no solution before Perl 6.

Match variables (the automatic ones and the numbered ones) are most often used in substitutions, which you'll see in the next chapter.

    Previous
    Table of Contents
    Next
    © 2000- NIV