Menu

#74 Inconsistency in CQP regexp matching

TODO-3.5
closed-fixed
None
6
2022-02-14
2022-02-12
No

The query [pos="PP$"] matches the Penn tag PP$, but queries [pos="PP$"%c] and [pos="PP$|PP$"] match PP instead.

The reason, of course, is that the first query is matched as a literal string rather than a regexp, so $ isn't interpreted as a metacharacter anchoring the regexp at end-of-string. CQP heuristically checks for metacharacters in do_flagged_string()<cqp/parse_actions.c>, but the list doesn't include the “useless” metacharacters $ and ^. This raises three questions:
1. Should we change behaviour to ensure consistency between the three queries? This might break existing applications (and users) who have unwittingly relied on the current inconsistent behaviour.
2. If we do, perhaps the current list "[](){}.*+|?\\" has further gaps?
3. Is do_flagged_string() the only place where this test is run or do we need to patch other functions as well?

Discussion

  • Andrew Hardie

    Andrew Hardie - 2022-02-14
    1. My vote is to add $ and ^ to the list, backwards compatibility be damned.
    2. No, that's the whole set. Other characters with special meaning in PCRE are dependent on one of those (e.g. : is only special after ?)
    3. No, there's another case of the same thing in do_XMLTag(). My inclination would be to define CHARS_MAKING_REGEX_NONLITERAL as a macro used in both locations.
     
  • Stephanie Evert

    Stephanie Evert - 2022-02-14

    Fixed in r1705. The macro CL_REGEX_METACHARACTERS is defined in cl/cl.h (and documentation explains that it's not a list of all "unsafe" characters despite the name).

     
  • Stephanie Evert

    Stephanie Evert - 2022-02-14
    • status: open --> closed-fixed
     

Log in to post a comment.