Regex optimisation does not respect PCRE esc seqs
PCRE has many more escape sequences than POSIX regex (inc Unicode properties) some of which consist of more than one character after the backslash.
For instance, the following are all single characters / wildcards:
and, even worse, there is a legal abbreviation of Unicode properties to
To quote man pcre:
If only one letter is specified with \p or \P, it includes all the gen-
eral category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
(something like /\pLust/ is not ambiguous because you are only allowed to abbreviate 1-character-properties, not 2-character-properties).
The CL regex optimiser scans for grains in a regex on the assumption that a \ only affects the following character and nothing afterwards. This assumption is clearly wrong for PCRE, at least in UTF8 mode, and could cause grains to be detected that are not real grains, and thus candidate strings to be falsely rejected.
For current users --- this bug won't kick in if you stick to basic regex syntax!
Log in to post a comment.