Приглашаем посетить
Соловьев (solovyev.lit-info.ru)

Section 12.7.  Brace Delimiters

Previous
Table of Contents
Next

12.7. Brace Delimiters

Use m{...} in preference to /.../ in multiline regexes.

You might have noticed that every regex in this book that spans more than a single line is delimited with braces rather than slashes. That's because it's much easier to identify the boundaries of the brace-delimited form, both by eye and from within an editor[*].

[*] Most editors can be configured to jump to a matching brace (in vi it's %; in Emacs it's a little more complicatedsee Appendix C). You can also set most editors to autohighlight matching braces as you type (set the blink-matching-paren variable in Emacs, or the showmatch option in vi).

That ability is especially important in regexes where you need to match a literal slash, or in regexes which use many escape characters. For example, this:


    Readonly my $C_COMMENT => qr{
        / \*   
# Opening C comment delimiter
.*?
# Smallest number of characters (C comments don't nest)
\* /
# Closing delimiter
}xms;

is a little easier to read than the more heavily backslashed:

    Readonly my $C_COMMENT => qr/
        \/ \*  # Opening C comment delimiter
        .*?    # Smallest number of characters (delims don't nest)
        \* \/  # Closing delimiter
    /xms;

Using braces as delimiters can also be advantageous in single-line regexes that are heavily laden with slash characters. For example:

    $source_code =~ s/ \/ \* (.*?) \* \/ //gxms;

is considerably harder to unravel than:


    $source_code =~ s{ / \* (.*?) \* / }{}gxms;

In particular, a final empty {} as the replacement text is much easier to detect and decipher than a final empty //. Though, of course, it would be better still to write that substitution as:


    $source_code =~ s{$C_COMMENT}{$EMPTY_STR}gxms;

to ensure maximum maintainability.

Using braces as regex delimiters has two other advantages. Firstly, in a substitution, the two "halves" of the operation can be placed on separate lines, to further distinguish them from each other. For example:


    $source_code =~ s{$C_COMMENT}
                     {$EMPTY_STR}xms;

The second advantage is that raw braces "nest" correctly within brace delimiters, whereas raw slashes don't nest at all within slash-delimited patterns. This is a particular problem under the /x flag, because it means that a seemingly straightforward regex like:

    
    # Parse a 'set' command in our mini-language...

    m/
        set      \s+  # Keyword

        ($IDENT) \s*  # Name of file/option/mode
        =        \s*  # literal =
        ([^\n]*)      # Value of file/option/mode
    /xms;

is seriously (and subtly) broken. It's broken because the compiler first determines that the regex delimiter is a slash, so it looks ahead to locate the next unescaped slash in the source, and treats the intervening characters as the pattern. Then it looks for any trailing regex flags, after which it continues parsing the next part of the current expression.

Unfortunately, in the previous example, the next unescaped slash in the source is the first unescaped slash in the line:

        ($IDENT) \s*  # Name of file/option/mode

which means that the regex finishes at that point, causing the code to be parsed as if it were something like:

    m/
        set      \s+ # Keyword
        ($IDENT) \s* # Name of file/o

    ption(  ) / mode(  ) = \s*           # literal =
                      ([^\n(  )]*)     # Value of file/option/mode
                      /xms(  )
                   );

whereupon it complains bitterly about the illegal call to the (probably non-existent) ption( ) subroutine, when it was expecting an operator or semicolon after the end of that nice m/.../o pattern. It probably won't be too pleased about the incomplete s*...*...* substitution with the weird asterisk delimiters either, or the dodgy assignment to mode( ).

The problem is that programmers expect comments to have no compile-time semantics[*]. But, within a regex, a comment becomes a comment only afterthe parser has decided where the surrounding regex finishes. So a slash character that seems to be within a regex comment may actually be a slash delimiter in your code.

[*] That sentence originally read: The problem is that programmers are used to ignoring the specific content of comments. Which is depressingly true, but not the relevant observation here.

Using braces as delimiters significantly reduces the likelihood of encountering this problem:


    m{
        set       \s+  
# Keyword
($IDENT) \s*
# Name of file/option/mode
= \s*
# literal =
([^\n]*)
# Value of file/option/mode
}xms;

because the slashes are no longer special to the parser, which consequently parses the entire regex correctly. Furthermore, as matching braces may be nested inside a brace-delimited regex, this variation is okay too:


    m{
        set       \s+  
# Keyword
($IDENT) \s*
# Name of file/option/mode
= \s*
# literal =
\{
# literal {
([^\n]*)
# Value of file/option/mode
\}
# literal }
}xms;

Of course, unbalanced raw braces still cause problems within regex comments:

    m{
        set       \s+  # Keyword
        ($IDENT)  \s*  # Name of file/option/mode
        =         \s*  # literal =
        ([^\n]*)       # Value of file/option/mode
         \}            # literal }
    }xms;

However, unlike /, unbalanced raw braces are not a valid English punctuation form, and hence they're far rarer within comments than slashes. Besides which, the error message that's generated by that particular mistake:


    Unmatched right curly bracket at demo.pl line 49, at end of line
    (Might be a runaway multi-line {} string starting on line 42)

is much clearer than the sundry lamentations the equivalent slash-delimited version would produce:

    Bareword found where operator expected at demo.pl line 46,
    near "($IDENT)     # File/option"
        (Might be a runaway multi-line // string starting on line 42)
        (Missing operator before ption?)
    Backslash found where operator expected at demo.pl line 49,
    near ")     # File/option/mode value \"
        (Missing operator before \?)
    syntax error at demo.pl line 46, near "($IDENT)     # File/option"
    Unmatched right curly bracket at demo.pl line 7, at end of line

So use m{...}xms in preference to /.../xms wherever possible. Indeed, the only reason to ever use slashes to delimit regexes is to improve the comprehensibility of short, embedded patterns. For example, within the blocks of list operations:


    my @counts = map { m/(\d{4,8})/xms } @count_reports;

slashes are better than braces. A brace-delimited version of the same regex would be using braces to denote "code block", "regex boundary", and "repetition count", all within the space of 20 characters:

    my @counts = map { m{(\d{4,8})}xms } @count_reports;

Using the slashes as the regex delimiters in this case increases the visual distinctiveness of the regex and thereby improves the overall readability of the code.

    Previous
    Table of Contents
    Next