#50 Numbered backrefs in string-level regex not working

TODO-3.5
closed-fixed
Andrew Hardie
7
2012-05-01
2012-04-02
Andrew Hardie
No

Only one of the various possible syntaxes for backrefs within a regex seem to be working.

For instance:

"(.)\1"

should in theory find all forms consisting of the same character twice. However, it's not working. Neither are the following, which according to man pcre should be equivalent:

"(.)\g1"
"(.)\g{1}"

The following, however, DOES work, even though it SHOULD be identical to the preceding:

"(?P<name>.)\g{name}"

I suspect this is due to the regex optimiser and its lack of full PCRE-awareness (in cl/regopt.c) -- i.e. it is doing an incorrect optimisation and doing simple string matching on the first three but not on the fourth -- but cannot be sure without further investigation.

Discussion

  • Serge Heiden
    Serge Heiden
    2012-04-03

    "(.)\2" does what "(.)\1" should do actually.
    There is apparently a +1 shift in the RE groups buffers in PCRE
    [TXM 0.6b2, CQP 3.4, Linux]

     
  • Stefan Evert
    Stefan Evert
    2012-04-03

    The CL internally rewrites the entered regexp <r> into ^(<r>)$ to enforce the anchoring. This was implemented in order to use standard regexp libraries; previously, the CWB included a specially hacked regexp implementation able to enforce anchoring.

    Solution for this: change the rewrite to ^(?:<r>)$ (or whatever the correct PCRE syntax for non-capturing parentheses was.

    The possibility of regexp optimiser problems should be investigated, though. I tried to be very conservative, but I'm not sure how escape sequences like \g are handled.

     
  • Andrew Hardie
    Andrew Hardie
    2012-04-03

    Hmm, that would explain it!

    I will change the wrap to a non-capturing bracket as soon as I get a chance.

     
  • Stefan Evert
    Stefan Evert
    2012-05-01

    Fixed in trunk in revision #313 (as suggested in comments).

     
  • Stefan Evert
    Stefan Evert
    2012-05-01

    • status: open --> closed-fixed