Приглашаем посетить
Ходасевич (hodasevich.lit-info.ru)

Section 7.3.  Character Classes

Previous
Table of Contents
Next

7.3. Character Classes

A character class, a list of possible characters inside square brackets ([ ]), matches any single character from within the class. It matches one character, but that character may be any of the ones listed.

For example, the character class [abcwxyz] may match any one of those seven characters. For convenience, you may specify a range of characters with a hyphen (-), so that class may also be written as [a-cw-z]. That didn't save much typing, but it's more common to make a character class like [a-zA-Z] to match any one letter out of that set of 52.[Section 7.3.  Character Classes] You may use the same character shortcuts as in any double-quoted string to define a character, so the class [\000-\177] matches any seven-bit ASCII character.[§] Of course, a character class will be just part of a full pattern and will never stand on its own in Perl. For example, you might see code that says something like this:

[Section 7.3.  Character Classes] Notice that those 52 don't include letters such as Å, É, Î, Ø, and Ü. But when Unicode processing is available, that particular character range is noticed and enhanced to do the right thing automatically.

[§] At least, if you use ASCII and not EBCDIC.

    $_ = "The HAL-9000 requires authorization to continue.";
    if (/HAL-[0-9]+/) {
      print "The string mentions some model of HAL computer.\n";
    }

Sometimes, it's easier to specify the omitted characters rather than the ones within the character class. A caret ("^") at the start of the character class negates it. That is, [^def] will match any single character except one of those three. And [^n\-z] matches any character except for n, hyphen, or z. (The hyphen is backslashed because it's special inside a character class. But the first hyphen in /HAL-[0-9]+/ doesn't need a backslash because hyphens aren't special outside a character class.)

7.3.1. Character Class Shortcuts

Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9], may be abbreviated as \d. Thus, the pattern from the example about HAL could be written /HAL-\d+/ instead.

The shortcut \w is a so-called "word" character: [A-Za-z0-9_]. If your "words" are made up of ordinary letters, digits, and underscores, you'll be happy with this. The rest of us have words made up of ordinary letters, hyphens, and apostrophes,[*] so we wish we could change this definition of "word".[Section 7.3.  Character Classes] So use this one only when you want ordinary letters, digits, and underscores.

[*] At least, in usual English you do. In other languages, you may have different components of words. Locales recognize these differences to a limited but useful extent. See the perllocale manpage.

[Section 7.3.  Character Classes] When looking at ASCII-encoded English text, you have the problem that the single quote and the apostrophe are the same character, so it's not possible in isolation to tell whether cats' is a word with an apostrophe or a word at the end of a quotation. This is probably one reason that computers haven't been able to take over the world yet.

Of course, \w doesn't match a "word" but matches a single "word" character. To match an entire word the plus modifier is handy. A pattern such as /fred \w+ barney/ will match fred and a space, a "word," and then a space and barney. That is, it'll match if there's one word[Section 7.3.  Character Classes] between fred and barney, set off by single spaces.

[Section 7.3.  Character Classes] We're going to stop saying "word" in quotes so much; you know by now that these letter-digit-underscore words are the ones we mean.

As you may have noticed in that previous example, it might be handy to be able to match spaces more flexibly. The \s shortcut is good for whitespace. It's the same as [\f\t\n\r ], which is a class containing the five whitespace characters: form-feed, tab, newline, carriage return, and the space character. These characters move the printing position around and don't use any ink. Like the other shortcuts you've seen, \s matches a single character from the class, so it's usual to use either \s* for any amount of whitespace (including none at all), or \s+ for one or more whitespace characters. (In fact, it's rare to see \s without one of those quantifiers.) Since all of those whitespace characters look about the same, you can treat them all in the same way with this shortcut.

7.3.2. Negating the Shortcuts

Sometimes you may want the opposite of one of these three shortcuts. That is, you may want [^\d], [^\w], or [^\s], meaning a nondigit character, a nonword character, or a nonwhitespace character. That's easy enough to accomplish by using their uppercase counterparts: \D, \W, or \S. These match any character that their counterpart would not match.

Any of these shortcuts will work in place of a character class (standing on their own in a pattern) or inside the square brackets of a larger character class. That means that you could use /[\dA-Fa-f]+/ to match hexadecimal (base 16) numbers, which use letters ABCDEF (or the same letters in lowercase) as additional digits.

Another compound character class is [\d\D], which means any digit or any non-digit, which means any character at all. This is a common way to match any character, even a newline as opposed to ., which matches any character except a newline. The totally useless [^\d\D] matches anything that's not either a digit or a non-digit. Rightnothing!

    Previous
    Table of Contents
    Next