#39 Regex optimisation does not respect PCRE esc seqs

TODO-3.5
pending-fixed
Stefan Evert
6
2012-05-01
2010-09-03
Andrew Hardie
No

Regex optimisation does not respect PCRE esc seqs

PCRE has many more escape sequences than POSIX regex (inc Unicode properties) some of which consist of more than one character after the backslash.

For instance, the following are all single characters / wildcards:

\p{M}
\p{Lu}
\P{Pf}
\x{1a024}

and, even worse, there is a legal abbreviation of Unicode properties to

\pL

To quote man pcre:
---------
If only one letter is specified with \p or \P, it includes all the gen-
eral category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:

\p{L}
\pL
---------
(something like /\pLust/ is not ambiguous because you are only allowed to abbreviate 1-character-properties, not 2-character-properties).

The CL regex optimiser scans for grains in a regex on the assumption that a \ only affects the following character and nothing afterwards. This assumption is clearly wrong for PCRE, at least in UTF8 mode, and could cause grains to be detected that are not real grains, and thus candidate strings to be falsely rejected.

For current users --- this bug won't kick in if you stick to basic regex syntax!

Discussion

  • Andrew Hardie
    Andrew Hardie
    2011-07-31

    • milestone: --> TODO-3.5
     
  • Stefan Evert
    Stefan Evert
    2012-05-01

    Modified regexp optimiser has been checked into SVN trunk (revision #312). Needs testing!

    CL in trunk/ is currently switched to debug mode, which does not optimise regexp matching and warns about false negatives that the optimiser would have produced. After satisfactory testing, don't forget to re-enable the optimiser for release version 3.5.

     
  • Stefan Evert
    Stefan Evert
    2012-05-01

    • status: open --> pending-fixed