Приглашаем посетить
PHP (php.find-info.ru)

Section 12.23.  Backtracking

Previous
Table of Contents
Next

12.23. Backtracking

Prevent useless backtracking.

In the final example of the previous guideline:


    qr{
       with \s+
       (?: each \s+
           (?:$EXPR
             | $VAR  \s* in \s* [(] $LIST [)]
           )
         | [(] $LIST [)]
       )
       \s* $BLOCK
    }xms

if the match successfully reaches the shared \s* $BLOCK suffix but subsequently fails to match the trailing block, then the regex engine will immediately backtrack. That backtracking will cause it to reconsider the various (nested) alternatives: first by backtracking within the previous successful alternative, and then by trying any remaining unexamined alternatives. That's potentially a lot of expensive matching, all of which is utterly useless. For a start, the syntaxes of the various options are mutually exclusive, so if one of them already matched, none of the subsequent candidates ever will.

Even if that weren't the case, the regex is backtracking only because there wasn't a valid block at the end of the loop specification. But backtracking and messing around with the other alternatives won't change that fact. Even if the regex does find another way to match the first part of the loop specification, there still won't be a valid block at the end of the string when matching reaches that point again.

This particular situation arises every time an alternation consists of mutually exclusive alternatives. The "dumb but fast" behaviour of the regex engine forces it to go back and mindlessly try every other possibility, even whento an outside observerthat's provably a complete waste of time and the engine would do much better to just forget about backtracking into the alternation.

As before, you have to explicitly point that optimization out to Perl. In this case, that's done by enclosing the alternation in a special form of parentheses: (?>...). These are Perl's "don't-ever-backtrack-into-me" markers. They tell the regex engine that the enclosed subpattern can safely be skipped over during backtracking, because you're confident that re-matching the contents either won't succeed or, if it does succeed, won't help the overall match.

In practical terms, you just need to replace the (?:...) parentheses of any mutually exclusive set of alternatives with (?>...) parentheses. For example, like this:


    m{
       with \s+
       (?> each \s+                              
# (?> means:
(?: $EXPR
#     There can be only
| $VAR \s* in \s* [(] $LIST [)]
#     one way to match
)
#     the enclosed set
| [(] $LIST [)]
#     of alternatives
)
# )
\s* $BLOCK }xms;

This kind of optimization is even more important for repeated subpatterns (especially those containing alternations).

Suppose you wanted to write a regular expression to match a parenthesized list of comma-separated items. You might write:

    $str =~ m{ [(]               # A literal opening paren
               $ITEM             # At least one item
               (?:               # Followed by...
                   ,             #     a comma
                   $ITEM         #     and another item
               )*                #  as many times as possible (but none is okay too)
               [)]               # A literal closing paren
             }xms;

That pattern works fine: it matches every parenthesized list you give it and fails to match everything else. But consider what actually happens when you give it nearly the right input. If, for example, $str contains a list of items that's missing its final closing parenthesis, then the regex engine would have to backtrack into the (?: , $ITEM)* and try to match one fewer comma-item sequence. But doing that will leave the matching position at the now-relinquished comma, which certainly won't match the required closing parenthesis. So the regex engine will backtrack again, giving back another comma-item sequence, leaving the matching position at the comma, where it will again fail to find a closing parenthesis. And so on and so on until every other possibility has failed.

There's no point in backtracking the repeated comma-item subpattern at all. Either it must succeed "all the way", or it can never succeed at all. So this is another ideal place to add a pair of non-backtracking parentheses. Like so:


    m{ [(]  $ITEM  (?> (?: , $ITEM )* )  [)] }xms;

Note that the (?>...) have to go around the entire repeated grouping; don't use them simply to replace the existing parentheses:

    m{ [(]  $ITEM      (?> , $ITEM )*    [)] }xms;     # A common mistake

This version is wrong because the repetition marker is still outside the (?>...) and hence it will still be allowed to backtrack (uselessly).

In summary, whenever two subpatterns X and Y are mutually exclusive in terms of the strings they match, then rewrite any instance of:

    
    X | Y

as:


    (?> 
X
|
Y
)

and rewrite:

    
    X* Y

as:


    (?> 
X
* )
Y

    Previous
    Table of Contents
    Next