#50 Numbered backrefs in string-level regex not working


Only one of the various possible syntaxes for backrefs within a regex seem to be working.

For instance:


should in theory find all forms consisting of the same character twice. However, it's not working. Neither are the following, which according to man pcre should be equivalent:


The following, however, DOES work, even though it SHOULD be identical to the preceding:


I suspect this is due to the regex optimiser and its lack of full PCRE-awareness (in cl/regopt.c) -- i.e. it is doing an incorrect optimisation and doing simple string matching on the first three but not on the fourth -- but cannot be sure without further investigation.


  • Serge Heiden

    Serge Heiden - 2012-04-03

    "(.)\2" does what "(.)\1" should do actually.
    There is apparently a +1 shift in the RE groups buffers in PCRE
    [TXM 0.6b2, CQP 3.4, Linux]

  • Stefan Evert

    Stefan Evert - 2012-04-03

    The CL internally rewrites the entered regexp <r> into ^(<r>)$ to enforce the anchoring. This was implemented in order to use standard regexp libraries; previously, the CWB included a specially hacked regexp implementation able to enforce anchoring.

    Solution for this: change the rewrite to ^(?:<r>)$ (or whatever the correct PCRE syntax for non-capturing parentheses was.

    The possibility of regexp optimiser problems should be investigated, though. I tried to be very conservative, but I'm not sure how escape sequences like \g are handled.

  • Andrew Hardie

    Andrew Hardie - 2012-04-03

    Hmm, that would explain it!

    I will change the wrap to a non-capturing bracket as soon as I get a chance.

  • Stefan Evert

    Stefan Evert - 2012-05-01

    Fixed in trunk in revision #313 (as suggested in comments).

  • Stefan Evert

    Stefan Evert - 2012-05-01
    • status: open --> closed-fixed

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks