#39 Regex optimisation does not respect PCRE esc seqs

TODO-3.5
closed-fixed
6
2016-07-20
2010-09-03
No

Regex optimisation does not respect PCRE esc seqs

PCRE has many more escape sequences than POSIX regex (inc Unicode properties) some of which consist of more than one character after the backslash.

For instance, the following are all single characters / wildcards:

\p{M}
\p{Lu}
\P{Pf}
\x{1a024}

and, even worse, there is a legal abbreviation of Unicode properties to

\pL

To quote man pcre:
---------
If only one letter is specified with \p or \P, it includes all the gen-
eral category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:

\p{L}
\pL
---------
(something like /\pLust/ is not ambiguous because you are only allowed to abbreviate 1-character-properties, not 2-character-properties).

The CL regex optimiser scans for grains in a regex on the assumption that a \ only affects the following character and nothing afterwards. This assumption is clearly wrong for PCRE, at least in UTF8 mode, and could cause grains to be detected that are not real grains, and thus candidate strings to be falsely rejected.

For current users --- this bug won't kick in if you stick to basic regex syntax!

Discussion

  • Andrew Hardie

    Andrew Hardie - 2011-07-31
    • milestone: --> TODO-3.5
     
  • Stefan Evert

    Stefan Evert - 2012-05-01

    Modified regexp optimiser has been checked into SVN trunk (revision #312). Needs testing!

    CL in trunk/ is currently switched to debug mode, which does not optimise regexp matching and warns about false negatives that the optimiser would have produced. After satisfactory testing, don't forget to re-enable the optimiser for release version 3.5.

     
  • Stefan Evert

    Stefan Evert - 2012-05-01
    • status: open --> pending-fixed
     
  • Stefan Evert

    Stefan Evert - 2016-07-20
    • status: pending-fixed --> closed-fixed
     
  • Stefan Evert

    Stefan Evert - 2016-07-20

    New implementation of case/accent-folding and regexp optimizer in CWB 3.4.10 should work correctly with full PCRE syntax. The optimizer only recognizes a subset of PCRE, but parses regexps defensively and simply does not apply if there is anything it doesn't understand.

     

Log in to post a comment.