The Metacharacters

In all the examples that follow, the portions of text matched by patterns are represented with underlines. Remember that the entire target string is said to match even if just a portion of it matches the regular expression. The underline marks are to help demonstrate exactly what part of the target they match.

It's important to read through the following sections; however, don't worry if the information doesn't immediately make sense; it will shortly. The application of these metacharacters will be demonstrated in a bit.

A Simple Metacharacter

The first of the metacharacters is the dot (.). Inside a regular expression, the dot matches any single character except a newline character. For example, in the pattern /p.t/, the . matches any single character. This pattern would match pot, pat, pit, carpet, python, and pup tent. The . requires that one character be there between the p and the t, but no more. Thus, the pattern would not match apt (no character at all between p and t) or expect (too many characters between p and t).

The Unprintables

Earlier you read that, to include a metacharacter inside a regular expression, you have to precede the character with a backslash, as shown here, to make it lose its meta-ness:


/\^\$/;    # A literal caret and dollar sign

When preceded by a backslash, normal characters become metacharacters. As you saw in Hour 2, some characters take on special meaning in (double-quoted) string literals when they are preceded by a backslash; almost all those same characters represent the same values in regular expressions, as shown in Table 6.1.

Table 6.1. Special Characters
Character
Matches
\n
A newline character
\r
A carriage return
\t
A tab
\f
A formfeed

Quantifiers

Until now, all the characters in patterns, whether text characters or metacharacters, have had a one-to-one relationship with characters in the target string they were trying to match. For example, in /Simon/, S matches an S, i matches an i, m matches an m, and so on. A quantifier is a kind of metacharacter that tells the regular expression how many consecutive occurrences of something to match. A quantifier can be placed after any single character or a group of characters (you'll learn more details on that topic momentarily).

The simplest quantifier is the + metacharacter. The + causes the preceding character to match at least once, or as many times as it can and still have a matching expression. Thus, /do+g/ would:

Match These
But Not These
Why Not
hounddog
badge
The required o is missing.
hotdog
doofus
The g is missing.
doogie howser
Doogie
D is not the same as d.
doooooogdoog
pagoda
The d, o, and g do not appear in order.

The * metacharacter is similar to the + metacharacter, but it causes the preceding character to be matched zero or more times. In other words, the /t*/ pattern means to match as many t's as possible, but if none exist, that's okay. Thus, /car*t/ would:

Match These
But Not These
Why Not
carted
carrot
The o intrudes into the pattern, but the ed follows the pattern.
cat
carl
The t in the pattern isn't optional, but the r is.
carrrt
caart
The a in the pattern can't be repeated, but the r can.

One step down from the * metacharacter is ?. The ? metacharacter causes the preceding character to be matched either zero times or once (but no more). So the pattern /c?ola/ causes a c to be matched if it's available; otherwise, that's okay. Then it is followed by o, l, and a; essentially, this pattern matches any string with ola in it, and if ola is preceded by a c, that string is matched as well.

The difference between the ? and * metacharacters is that /c?ola/ would match cola and ola, but not ccola. The extra c requires two matches. The pattern /c*ola/ would match cola, ola, and ccola because the c can be repeated as many times as necessary, not just zero or one time.

If matching zero, one, or many occurrences of a pattern isn't specific enough for you, Perl allows you to match exactly as many occurrences as you need by using braces, {}. The quantifier with braces has the following format:


pat{n,m}

Here, n is the minimum number of matches, m is the maximum number of matches, and pat is the character or group of characters you're trying to quantify. You can omit either n or m, but not both. Consider the following examples:

/x{5,10}/
x occurs at least 5 times, but no more than 10.
/x{9,}/
x occurs at least 9 times, possibly more.
/x{0,4}/
x occurs up to 4 times, possibly not at all.
/x{8}/
x must occur exactly 8 times.

A common idiom in regular expressions is .*. You can use it to match anything—usually anything between two other things that you're interested in. For example, /first.*last/ attempts to match the word first, followed by anything, and then the word last. Observe how /first.*last/ matches the following strings:

first then last
The good players get picked first, the bad last.
The first shall be last, and the last shall be first.

Look at the match in the third line carefully. The match starts on the word first as expected. The match then matches the word last, but it doesn't consider itself done. It continues searching until it finds the second (and final) occurrence of the word last. Here, the * follows the fourth rule listed in the section "Rules of the Game": It matches the largest possible string, while still completing the match. Often, matching the largest string is not what you want, so Perl offers another solution called minimal matching, which is documented further in the perlre manual page.

Character Classes

Another common practice in regular expressions is to ask for a match of "any of these characters." If you're trying to match numbers, it would be nice to be able to write a pattern that matches "any digit 0–9"; if you're searching a list of names and want to match Van Beethoven and van Beethoven, a pattern that matches "either v or V" would be helpful.

Perl's regular expressions have such a tool; it's called a character class. To write a character class, you enclose the characters it contains in square brackets, []. Characters in a character class are treated as a single character during the match. Inside a character class, you can specify ranges of characters (where ranges make sense) by putting a dash between the upper and lower bounds. The following are some examples:

Character Class
Explanation
[abcde]
Match any of a, b, c, d, or e
[a-e]
Same as above; match any of a, b, c, d, or e
[ls]Gg[rs]
Match an uppercase G or lowercase g
[0-9]
Match a digit
[0-9]+
Match one or more digits in sequence
[A-Za-z]{5}
Match any group of five alphabetic characters
[*!@#$%&()]
Match any of these punctuation marks

The last example is interesting because the characters in that class are usually metacharacters. Inside a character class, most metacharacters lose their "meta-ness"; in other words, they behave like any other ordinary character. Thus, the * really represents a literal *.

If a caret (^) occurs as the first character of a character class, the character class is negated. That is, the character class matches any single character that is not in the class, as in this example:


/[^A-Z]/;        # Matches non-uppercase-alphabetic characters.

Because ], ^, and - are special in a character class, some rules apply about trying to match those characters literally in a character class. To match a literal ^ in a character class, you must make sure it does not occur first in the class. To match a literal ], you either need to put it first in the class or put a backslash in front of it (for example, /[abc\]]/). To put a literal hyphen (-) in a character class, you can simply put it first in the class or put a backslash in front of it.

Perl contains shortcuts for certain commonly used character classes. They are represented by a backslash and a nonmetacharacter, as shown in Table 6.2.

Table 6.2. Special Character Classes
Pattern
Matches
\w
A word character; same as [a-zA-Z0-9_]
\W
A nonword character (the inverse of \w)
\d
A digit; same as [0-9]
\D
A nondigit
\s
A whitespace character; same as [ \t\f\r\n]
\S
A nonwhitespace character

The following are some examples:


/\d{5}/;       # Matches 5 digits

/\s\w+\s/;     # Matches a group of word-characters surrounded by whitespace

Be careful, though. The last example here doesn't necessarily match a word; it can also match an underscore surrounded by spaces. Also, not all words are matched by the last pattern; they need to have whitespace around them, and words such as "don't" wouldn't be matched because of the apostrophe. You'll learn better patterns for word matching later in this hour.

Grouping and Alternation

Sometimes in a regular expression, you might want to know whether any of a set of patterns is found. For example, does this string contain dogs or cats? The regular-expression solution to this problem is called alternation. Alternation happens in a regular expression when possible matches are separated with a | character, as in this example:


if (/dogs|cats/) {

    print "\$_ contains a pet\n";

}

Alternation can be fun, but it also can be tedious when you want to match lots of similar things. For example, if you want to match the words frog, bog, log, flog, or clog, you could try the expression /frog|bog|log|flog|clog/ except that it's horribly repetitive. What you really want is to alternate on just the first part of the string like this:


/fr|b|l|fl|clog/;     # Doesn't QUITE work.

The preceding example doesn't quite work because Perl has no way of knowing that the alternations are one thing you want to match and og is another.

To solve this problem, you can use Perl's regular expressions to group parts of the pattern with parentheses, (), as shown here:


/(fr|b|l|fl|cl)og/;

You can nest parentheses to have groups within groups. For example, you could write the preceding expression as /(fr|b|(f|c)?l)og/ as well.

In a list context, the match operator returns a list of the portions of the expression matched that were in parentheses. Each parenthesized value is a return value to the list, or 1 if the pattern contains no parentheses. Check out this example:


$_="apple is red";

($fruit, $color)=/(.*)\sis\s(.*)/;

In this snippet, the pattern matches anything (as a group), and then whitespace, the word is, more whitespace, and then anything (also as a group). The two grouped expressions are returned to the list on the left side and assigned to $fruit and $color.

Anchors

The last two metacharacters (I bet you thought they'd never end) are the anchors. You use anchors to tell the regular expression engine exactly where you want to look for the pattern—at the beginning of a string or at the end.

The first of these anchors is the caret (^). The caret at the beginning of a regular expression causes the expression to match only at the beginning of a line. For example, /^video/ matches the word video only if it occurs at the beginning of a line.

Its counterpart is the dollar sign ($). The dollar sign at the end of a regular expression causes the pattern to match only at the end of the line. For example, /earth$/ matches earth, but only at the end of a line.

Patterns
What They Do
/^Help/
Matches only lines that begin with Help.
/^Frankly.*darn$/
Matches lines that begin with Frankly and end in darn. Everything in between is matched as well.
/^hysteria$/
Matches lines that contain only the word hysteria.
/^$/
Matches the beginning of a line, followed immediately by the end of the line. That is, it matches only blank lines.

Previous Table of Contents Next