Pattern Matching Odds and Ends

Now that you can match patterns against $_ and you know the basics of substitution, you're ready for more functionality. To be really effective with regular expressions, you need to match against variables other than $_, be able to do sophisticated substitutions, and work with Perl's functions that are geared toward—but not exclusive to—regular expressions.

Working with Other Variables

In Listing 6.2, the weight gathered from the user is stored in $_ and manipulated with substitution operators and matching operators. This listing does have a problem, however: $_ isn't exactly the best variable name to store "weight" in. It's not very intuitive for starters, and $_ might get altered when you least expect it.

Watch Out!

In general, storing anything in $_ for long is playing with fire; eventually, you will get burned. Many of Perl's operators use $_ as a default argument, and some of them modify $_ as well. $_ is Perl's general-purpose variable, and trying to keep a value in $_ for very long (especially after what you learn in Hour 8, "Functions") will cause bugs eventually.

Using a variable called $weight would have been better in Listing 6.2. To use the match operator and substitution operator against variables other than $_, you must bind them to the variable. You do so by using the binding operator, =~, as shown here:

The =~ operator doesn't make assignments; it merely takes the operator on the right and causes it to act on the variable to the left. The entire expression has the same value as it would if $_ were used, as you can see in this example:

Modifiers and Multiple Matching

Until now, all the regular expressions you've seen have been case sensitive. That is, upperand lowercase characters are distinct in a pattern match. To match words and not care about whether they're in upperor lowercase would require something like this:

This example doesn't just look silly; it's error-prone because it would be really easy to mistype an upper-/lowercase pair. The substitution operator (s///) and the match operator (m//) can match regular expressions regardless of case if followed with the letter i:

The preceding example matches Macbeth in uppercase, lowercase, or mixed case (MaCbEtH).

Another modifier for matches and substitutions is the global-match modifier, g. The regular expression (or substitution) is done not just once, but repeatedly through the entire string, each match (or substitution) taking place starting immediately after the first one.

The g modifier (and other modifiers) can be combined by simply specifying all of them after the match or substitution operator. For example, gi matches all occurrences of the pattern in the string, whether uppercase or lowercase.

In a list context, the global-match modifier causes the match to return a list of all the portions of the regular expression that are in parentheses:

The pattern matches a nonword character, and then the letter f, followed by four word characters. The f and the four word characters form a group, marked by parentheses. After the expression is evaluated, the array variable @F will contain four elements: fish, frog, fred, and foul.

In a scalar context, the g modifier causes the match to iterate through the string, returning true for each match and false when no more matches are made. Now consider the following:

The preceding snippet uses the match operator (//) with a g modifier in a scalar context (which is provided by the condition of while). The pattern matches a word character. The while loop continues (and $letters gets incremented) until the match returns false. When the snippet is all done, $letters will be 11.

By the Way

You'll find much more efficient ways of counting characters than this presented in Hour 9, "More Functions and Operators."

Backreferences

When you use parentheses in regular expressions, Perl remembers the portion of the target string matched by each parenthesized expression. These matched portions are saved in special variables named $1 (for the first set of parentheses), $2 (for the second), $3, $4, and so on, as follows:

The pattern shown matches well-formed U.S./Canadian telephone numbers—for example, 800-555-1212—and remembers each portion in $1, $2, and $3. The values are assigned for each set of parentheses found, from left to right. If there are nested and overlapping parentheses, the captures are numbered from left to right for each opening parenthesis. These variables can be used after the following expression:

Or they can be used as part of the replacement text in a substitution, as follows:

Be careful, however; the variables $1, $2, and $3 are reset every time a pattern match is successfully performed (regardless of whether it uses parentheses), and the variables are set if and only if the pattern match succeeds completely. Based on this information, consider the following example:

In this snippet, $1 was used without making sure the pattern match worked. This will probably cause trouble if the match ever fails.

A New Function: grep

A common operation in Perl is to search arrays for patterns—for example, if you've read a file into an array and need to know which lines contain a particular word. Perl has one function in particular that you can use in this situation; it's called grep. The syntax for grep is as follows:

The grep function iterates through each element in list and then executes the expression or block. Within the expression or block, $_ is set to each element of the list being evaluated. If the expression returns true, the element is returned by grep. Consider this example:

In the preceding example, each element of @dogs is assigned, in turn, to $_. The expression /hound/ is then tested against $_. Each of the elements that returns true—that is, each name that contains hound—goes into a list that is returned by grep and stored in @hounds.

You need to remember two points here. First is that $_ within the expression refers to the actual value in the list, not a copy of it. Modifying $_ changes the original element in the list:

After running this example, @hounds contains greyhounds and bloodhounds, with an s on the end. The original array @dogs is also modified—by way of changing $_—and it now contains greyhounds, bloodhounds, terrier, mutt, and chihuahua.

The other point to remember—which Perl programmers forget sometimes—is that grep isn't necessarily used with a pattern match or substitution operator; it can be used with any operator or function. The following example collects just the names of dogs longer than eight characters:

By the Way

The grep function gets its name from a Unix command by the same name that is used for searching for patterns in files. The Unix grep command is so useful in Unix (and hence, Perl) that in the culture it has become a verb: "to grep." "To grep through a book" means to flip through the pages looking for a pattern.

A related function, map, has an identical syntax to grep, except that the return value from the expression (or block) is returned from map—not the value of $_. You use the map function to produce a second array based on the first. The following is an example:

In this example, each element of the array @input (passed to the block as $_) is split apart on spaces, producing a list of words; this list is added to the list that the map function returns. After every consecutive line of @input has been split apart, the accumulated words are stored in @words.

Pattern Matching Odds and Ends