Section 7.2.  Using Simple Patterns

Table of Contents

7.2. Using Simple Patterns

To match a pattern (regular expression) against the contents of $_, put the pattern between a pair of forward slashes (/) as we do here:

    $_ = "yabba dabba doo";
    if (/abba/) {
      print "It matched!\n";

The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value. In this case, it's found more than once, but that doesn't make any difference. If it's found at all, it's a match; if it's not in there at all, it fails.

Because the pattern match is generally being used to return a true or false value, it is almost always found in the conditional expression of if or while.

All of the usual backslash escapes that you can put into double-quoted strings are available in patterns, so you could use the pattern /coke\tsprite/ to match the eleven characters of coke, a tab, and sprite.

7.2.1. About Metacharacters

If patterns matched only literal strings, they wouldn't be very useful. That's why a number of special characters, called metacharacters, have special meanings in regular expressions.

For example, the dot (.) is a wildcard characterit matches any single character except a newline (which is represented by "\n"). So, the pattern /bet.y/ would match betty. It would also match betsy, bet=y, bet.y, or any other string that has bet, followed by any one character (except a newline), followed by y. It wouldn't match bety or betsey since those don't have one character between the t and the y. The dot always matches exactly one character.

If you wanted to match a period in the string, you could use the dot. But that would match any possible character (except a newline), which might be more than you wanted. If you want the dot to match a period, you can backslash it. That rule goes for all of Perl's regular expression metacharacters: a backslash in front of any metacharacter makes it nonspecial. So, the pattern /3\.14159/ doesn't have a wildcard character.

The backslash is our second metacharacter. If you mean a real backslash, use a pair of thema rule that applies everywhere else in Perl.

7.2.2. Simple Quantifiers

It often happens that you'll need to repeat something in a pattern. The star (*) means to match the preceding item zero or more times. So, /fred\t*barney/ matches any number of tab characters between fred and barney. It matches "fred\tbarney" with one tab, "fred\t\tbarney" with two tabs, "fred\t\t\tbarney" with three tabs, or "fredbarney" with nothing in between at all. That's because the star means "zero or more"so you could have hundreds of tab characters in between but nothing other than tabs. Think of the star as saying, "That previous thing, any number of times, even zero times" (because * is the "times" operator in multiplication).

What if you wanted to allow something besides tab characters? The dot matches any character,[*] so .* will match any character, any number of times. That means that the /fred.*barney/ pattern matches "any old junk" between fred and barney. Any line that mentions fred and (somewhere later) barney will match that pattern. We often call .* the "any old junk" pattern because it can match any old junk in your strings.

[*] Except newline. But we're going to stop reminding you of that so often because you know it by now. Most of the time it doesn't matter because your strings will most often not have newlines. Don't forget this detail because someday a newline will sneak into your string and you'll need to remember that the dot doesn't match newline.

The star is formally called a quantifier, meaning that it specifies a quantity of the preceding item. It's not the only quantifier; the plus (+) is another. The plus means to match the preceding item one or more times: /fred +barney/ matches if fred and barney are separated by spaces and only spaces. (The space is not a metacharacter.) This won't match fredbarney since the plus means there must be one or more spaces between the two names, so at least one space is required. Think of the plus as saying, "That last thing, plus (optionally) more of the same thing."

There's a third quantifier like the star and plus, which is more limited. It's the question mark (?), which means that the preceding item is optional in that it may occur once or not at all. Like the other two quantifiers, the question mark means that the preceding item appears a certain number of times. In this case, the item may match one time (if it's there) or zero times (if it's not). There aren't any other possibilities. So, /bamm-?bamm/ matches either spelling: bamm-bamm or bammbamm. This is easy to remember since it's saying, "That last thing, maybe? Or maybe not?"

All three of these quantifiers must follow something since they tell how many times the previous item may repeat.

7.2.3. Grouping in Patterns

Parentheses are also metacharacters. As in mathematics, parentheses (( )) may be used for grouping. As an example, the pattern /fred+/ matches strings like freddddddddd, but strings like that don't show up often in real life. But the pattern /(fred)+/ matches strings like fredfredfred, which is more likely to be what you wanted. What about the pattern /(fred)*/? That matches strings like hello, world.[Section 7.2.  Using Simple Patterns]

[Section 7.2.  Using Simple Patterns] The star means to match zero or more repetitions of fred. When you're willing to settle for zero, it's hard to be disappointed. That pattern will match any string, even the empty string.

7.2.4. Alternatives

The vertical bar (|), often pronounced "or" in this usage, means that the left or right side may match. That is, if the part of the pattern on the left of the bar fails, the part on the right gets a chance to match. So, /fred|barney|betty/ will match any string that mentions fred, or barney, or betty.

Now you can make patterns like /fred( |\t)+barney/, which matches if fred and barney are separated by spaces, tabs, or a mixture of the two. The plus means to repeat one or more times; each time it repeats, the ( |\t) has the chance to match a space or a tab.[*] There must be at least one of those characters between the two names.

[*] This particular match would normally be done more efficiently with a character class, as you'll see later in this chapter.

If you wanted the characters between fred and barney to all be the same, you could rewrite that pattern as /fred( +|\t+)barney/. In this case, the separators must be all spaces or all tabs.

The pattern /fred (and|or) barney/ matches any string containing either of the two possible strings: fred and barney, or fred or barney.[Section 7.2.  Using Simple Patterns] You could match the same two strings with the pattern /fred and barney|fred or barney/, but that would be too much typing. It would probably also be less efficient, depending upon what optimizations are built into the regular expression engine.

[Section 7.2.  Using Simple Patterns] The words and and or are not operators in regular expressions! They are shown here in a fixed-width typeface because they're part of the strings.

    Table of Contents
    © 2000- NIV