Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#3055 Regexp backreference fail with a * closure

obsolete: 8.4.9
open
Pavel Goran
8
2009-05-23
2005-02-03
James Bonfield
No

The documentation has an example:

"([bc])\1 matches bb or cc but not `bc'". So I tried:

% regexp {^([bc])\1$} bb
1
% regexp {^([bc])\1$} cc
1
% regexp {^([bc])\1$} bc
0

All well and good. But what if we want b* or c* but not
[bc]* (ie disallow bbcb). The obvious solution is \1*:

% regexp {^([bc])\1*$} bbb
1
% regexp {^([bc])\1*$} bcb
1

bcb should not match!

Oddly, using \1+ works (but means of course that "b" or
"c" solo does not match).

James

Discussion

  • Jan Wielemaker
    Jan Wielemaker
    2005-09-18

    Logged In: YES
    user_id=1222641

    Looks related to what I found:

    % set x {cxdaxa}
    cxdaxa
    % regexp {((.)x\2)+} $x match; set match
    cxdaxa
    % set x {cydaxa}
    cydaxa
    % regexp {((.)x\2)+} $x match; set match
    axa

    TCL version 8.4.7, SuSE Linux 9.2

     
  • Logged In: YES
    user_id=79902

    There are comments in the RE engine that describe backrefs
    as "the feature from the black lagoon". I feel that's
    instructive (if scary).

    FWIW (and the -inline flag is great for this):
    % regexp -inline {^([bc])\1+?$} bbb
    bbb b
    % regexp -inline {^([bc])\1+?$} bcb
    % info patch
    8.4.7

     
  • Logged In: YES
    user_id=79902

    Oops, +? means non-greedy as opposed to the somewhat unusual
    thing I wanted. Need more clutter to get that I suppose...
    % regexp -inline {^([bc])(?:\1+)?$} b
    b b
    % regexp -inline {^([bc])(?:\1+)?$} bc
    % regexp -inline {^([bc])(?:\1+)?$} bb
    bb b
    % regexp -inline {^([bc])(?:\1+)?$} bbbbbb
    bbbbbb b

     
  • Logged In: YES
    user_id=32759
    Originator: NO

    A slightly simpler set of examples is:

    % regexp {^(.)\1*$} abc
    1
    % regexp {^(a)\1*$} abc
    0
    % regexp {^(.)(\1)*$} abc
    0

    All of these should return 0.

     
  • There are still serious problems here with the HEAD.

    $ make shell
    DYLD_LIBRARY_PATH=`pwd`: TCLLIBPATH="/Users/dkf/Documents/software/tcl/unix/pkgs" TCL_LIBRARY="/Users/dkf/Documents/software/tcl/library" ./tclsh
    % info patch
    8.6b1.1
    % regexp -all -inline {(.)\1*} aaabbbcd
    aaabbbcd a
    % regexp -all -inline {(.)\1+) aaabbbcd
    ^Cmake: *** [shell] Interrupt

    All I was after was an expression that would allow me to get all the runs of any character (as a first stage to writing a run-length encoder). With a “*”, it matched the wrong thing. With a “+” it hangs! (Or at least it wasn't producing an answer in a reasonable amount of time so I killed it...) Elevating priority because of the “+” problem.

     
    • priority: 5 --> 8
     
  • Tom Lane
    Tom Lane
    2012-02-14

    I tried to upload this as a patch, but that didn't seem to work amazingly well, so let's see about pasting it right into the comment box ...

    The attached patch fixes the case of "\n*", that is a backref with *
    quantification directly attached to it.

    The core of the problem is in regcomp.c, which first changes "\n*" to
    "\n+|" (lines 1167ff as of 8.5.11) and then applies repeat() to the NFA
    representing the backref atom (line 1198). repeat() thinks that any arc
    leading into its "rp" argument is part of the sub-NFA to be repeated ---
    see moveins() call at line 1363. Unfortunately, since line 1170 already
    created the arc that was intended to represent the empty bypass around
    "\n+", this arc gets moved too, so that it now leads into the state loop
    created by repeat(). Thus, what was supposed to be an "empty" bypass gets
    turned into something that represents zero or more repetitions of the NFA
    representing the backref atom. In the original example, in place of
    ^([bc])\1*$
    we now have something that acts like
    ^([bc])(\1+|[bc]*)$
    At runtime, the branch involving the actual backref fails, as it's supposed
    to, but then the other branch succeeds anyway.

    We could no doubt fix this by some rearrangement of the operations in
    parseqatom(), but that code is plenty ugly already. What I propose instead
    is to allow the m == 0 case to be handled at runtime, so that there is no
    need to install the empty-alternative branch at all for the backref case.
    This makes the patch in regcomp.c a one-liner, at the cost of having to
    tweak cbrdissect() a little. In the attached patch I went a bit further
    than that and rewrote cbrdissect() to check all the string-length-related
    conditions before it starts comparing characters. It seems a bit stupid to
    possibly iterate through many many copies of an n-character backreference,
    only to fail at the end because the target string's length isn't a multiple
    of n --- we could have found that out before starting. The existing coding
    could only be a win if integer division is hugely expensive compared to
    character comparison, but I don't know of any modern machine where that
    might be true.

    The patch also corrects an obviously bogus comment at regexec.c:802.
    Perhaps at one time it was valid, but now this is the only caller of
    cbrdissect() ...

    This does not fix all the problems with quantified back-references. In
    particular, the code is completely broken for back-references that appear
    within a larger expression that is quantified, so several of the cases
    shown in the comments to this bug entry still don't work. I do not have
    a solution for that yet; it looks to me like it may require major surgery.
    However, the changes I propose now seem appropriate to make anyway. Doing
    direct handling of quantification in a simple back-ref is a performance win
    compared to anything we can possibly do at a higher level, so this code
    path seems worth preserving even if we need another solution for the more
    general case.

    diff -cr tcl8.5.11.orig/generic/regcomp.c tcl8.5.11/generic/regcomp.c
    *** tcl8.5.11.orig/generic/regcomp.c Tue Nov 1 10:04:51 2011
    --- tcl8.5.11/generic/regcomp.c Mon Feb 13 16:03:05 2012
    ***************
    *** 1161,1170 ****
    }

    /*
    ! * It's quantifier time; first, turn x{0,...} into x{1,...}|empty
    */

    ! if (m == 0) {
    EMPTYARC(s2, atom->end);/* the bypass */
    assert(PREF(qprefer) != 0);
    f = COMBINE(qprefer, atom->flags);
    --- 1161,1172 ----
    }

    /*
    ! * It's quantifier time. If the atom is just a BACKREF, we'll let it deal
    ! * with quantifiers internally. Otherwise, the first step is to turn
    ! * x{0,...} into x{1,...}|empty
    */

    ! if (m == 0 && atomtype != BACKREF) {
    EMPTYARC(s2, atom->end);/* the bypass */
    assert(PREF(qprefer) != 0);
    f = COMBINE(qprefer, atom->flags);
    diff -cr tcl8.5.11.orig/generic/regexec.c tcl8.5.11/generic/regexec.c
    *** tcl8.5.11.orig/generic/regexec.c Tue Nov 1 10:04:51 2011
    --- tcl8.5.11/generic/regexec.c Mon Feb 13 16:03:11 2012
    ***************
    *** 799,805 ****
    case '|': /* alternation */
    assert(t->left != NULL);
    return caltdissect(v, t, begin, end);
    ! case 'b': /* back ref -- shouldn't be calling us! */
    assert(t->left == NULL && t->right == NULL);
    return cbrdissect(v, t, begin, end);
    case '.': /* concatenation */
    --- 799,805 ----
    case '|': /* alternation */
    assert(t->left != NULL);
    return caltdissect(v, t, begin, end);
    ! case 'b': /* back reference */
    assert(t->left == NULL && t->right == NULL);
    return cbrdissect(v, t, begin, end);
    case '.': /* concatenation */
    ***************
    *** 1065,1076 ****
    chr *begin, /* beginning of relevant substring */
    chr *end) /* end of same */
    {
    - int i;
    int n = t->subno;
    ! size_t len;
    ! chr *paren;
    chr *p;
    - chr *stop;
    int min = t->min;
    int max = t->max;

    --- 1065,1076 ----
    chr *begin, /* beginning of relevant substring */
    chr *end) /* end of same */
    {
    int n = t->subno;
    ! size_t numreps;
    ! size_t tlen;
    ! size_t brlen;
    ! chr *brstring;
    chr *p;
    int min = t->min;
    int max = t->max;

    ***************
    *** 1081,1091 ****

    MDEBUG(("cbackref n%d %d{%d-%d}\n", t->retry, n, min, max));

    if (v->pmatch[n].rm_so == -1) {
    return REG_NOMATCH;
    }
    ! paren = v->start + v->pmatch[n].rm_so;
    ! len = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;

    /*
    * No room to maneuver -- retries are pointless.
    --- 1081,1092 ----

    MDEBUG(("cbackref n%d %d{%d-%d}\n", t->retry, n, min, max));

    + /* Get the backreferenced string */
    if (v->pmatch[n].rm_so == -1) {
    return REG_NOMATCH;
    }
    ! brstring = v->start + v->pmatch[n].rm_so;
    ! brlen = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;

    /*
    * No room to maneuver -- retries are pointless.
    ***************
    *** 1096,1146 ****
    }
    v->mem[t->retry] = 1;

    ! /*
    ! * Special-case zero-length string.
    ! */
    !
    ! if (len == 0) {
    ! if (begin == end) {
    return REG_OKAY;
    }
    return REG_NOMATCH;
    }
    !
    ! /*
    ! * And too-short string.
    ! */
    !
    ! assert(end >= begin);
    ! if ((size_t)(end - begin) < len) {
    return REG_NOMATCH;
    }
    - stop = end - len;

    /*
    ! * Count occurrences.
    */

    ! i = 0;
    ! for (p = begin; p <= stop && (i < max || max == INFINITY); p += len) {
    ! if ((*v->g->compare)(paren, p, len) != 0) {
    ! break;
    ! }
    ! i++;
    }
    - MDEBUG(("cbackref found %d\n", i));
    -
    - /*
    - * And sort it out.
    - */

    ! if (p != end) { /* didn't consume all of it */
    ! return REG_NOMATCH;
    ! }
    ! if (min <= i && (i <= max || max == INFINITY)) {
    ! return REG_OKAY;
    ! }
    ! return REG_NOMATCH; /* out of range */
    }

    /*
    --- 1097,1145 ----
    }
    v->mem[t->retry] = 1;

    ! /* Special cases for zero-length strings */
    ! if (brlen == 0) {
    ! /*
    ! * Matches only if target is zero length, but any number of
    ! * repetitions can be considered to be present
    ! */
    ! if (begin == end && min <= max) {
    ! MDEBUG(("cbackref matched trivially\n"));
    return REG_OKAY;
    }
    return REG_NOMATCH;
    }
    ! if (begin == end) {
    ! /* Matches only if zero repetitions are okay */
    ! if (min == 0) {
    ! MDEBUG(("cbackref matched trivially\n"));
    ! return REG_OKAY;
    ! }
    return REG_NOMATCH;
    }

    /*
    ! * Check target length to see if it could possibly be an allowed number of
    ! * repetitions of brstring
    */
    + assert(end > begin);
    + tlen = end - begin;
    + if (tlen % brlen != 0)
    + return REG_NOMATCH;
    + numreps = tlen / brlen;
    + if (numreps < min || (numreps > max && max != INFINITY))
    + return REG_NOMATCH;

    ! /* Okay, compare the actual string contents */
    ! p = begin;
    ! while (numreps-- > 0) {
    ! if ((*v->g->compare) (brstring, p, brlen) != 0)
    ! return REG_NOMATCH;
    ! p += brlen;
    }

    ! MDEBUG(("cbackref matched\n"));
    ! return REG_OKAY;
    }

    /*

     
  • Tom Lane
    Tom Lane
    2012-02-14

    Sigh, that doesn't look very nice either. If you can't extract the patch from patch item 3487443, email me at tgl at sss dot pgh dot pa dot us, and I'll send you a clean copy.

     
  • Tom Lane
    Tom Lane
    2012-02-14

    Some notes about fixing the more general issue ...

    For context, realize that Tcl's regex engine is a hybrid DFA/NFA
    implementation. Mostly it relies on the DFA logic, which has fast and
    stable performance because it never has to backtrack. But there are two
    significant regex features that DFA can't handle: capturing parentheses and
    back-references. To deal with those things, the regex compilation code
    builds a tree of "subRE"s, which will be searched at runtime in NFA style.
    The existing types of subREs are:

    * plain: matches any string described by a DFA.
    * backref: matches any string that is equal to a string identified by a
    previous capture (or actually, any string that is N repetitions of such
    a string, but I digress).
    * capture: matches any string that matches its child subRE, then saves the
    match for later output or backref-matching.
    * concatenation: matches any string that matches the concatenation of its
    two child subREs.
    * alternation: matches any string that matches either of its two child
    subREs.

    concatenation implies searching, since it has to consider each possible
    division of the presented string into two parts to see if they can match
    its two child subREs. Similarly, alternation requires searching, although
    it has only two choices to consider not N. To reduce unnecessary
    searching, Henry Spencer made the compilation code create a DFA for each
    subRE node, not only the "plain" nodes. The DFA describes what could
    possibly match the subRE in as much detail as DFA can manage (which is
    actually everything except the effects of back-refs). At runtime, the
    subRE tree searching code first executes the associated DFA, and gives up
    immediately if it finds no match. Otherwise it starts working through the
    possible match candidates.

    Now, if you're a programmer, it probably struck you as odd that the subRE
    types listed above don't include any iteration capability. How did Henry
    expect to deal with quantified regexes? The answer appears in the "it's
    quantifier time" portion of parseqatom(). He transforms the general case
    x{m,n} as follows:

    * first, turn x{0,...} into x{1,...}|empty

    * next, knowing we have m>0 and n>0, turn x{m,n} into x{m-1,n-1}x, with
    capturing parens in only the second x

    The reference to capturing parens is telling --- he wasn't thinking about
    back-refs when he wrote this. What the code is actually doing is pushing
    all the "messy" stuff (that is, subREs possibly involving capture or
    backref) into the second child of the concatenation subRE it creates here.
    The first child subRE, for "x{m-1,n-1}", is always implemented as a plain
    DFA node. Now, that's fine for messiness that consists of capturing
    parens; capturing parens don't affect match behavior, so there's no need
    to consider them in the first child subRE. (What this implementation
    effectively says is that if you write capturing parens inside a quantifier,
    say "(x)+", what will be captured is the text that matched the last
    iteration of the quantifier. No objection from me, though I do not see
    this behavioral detail documented in the re_syntax man page ... and come to
    think of it, I wonder whether the POSIX regex spec has anything to say
    about the subject.)

    BUT: that logic falls down completely when you consider back-refs.
    A back-ref absolutely does affect match behavior, but what the first child
    subRE has to work with is only a copy of the DFA for the back-ref's
    referent expression. That is, if you write something like
    (\w+)(\s+\1)+
    (intended to match 2 or more repetitions of the same word), the actual
    behavior you get is equivalent to
    (\w+)(\s+\w+)*(\s+\1)
    ie, the last word is checked to be equal to the first, but any words in the
    middle are only checked to be words, because the subRE for "x{m-1,n-1}" is
    only a DFA for "(\s+\w+)*".

    Note that this only comes into play when the back-ref is a part of a larger
    quantified expression --- a quantifier attached directly to a back-ref uses
    a different code path, which has a problem with "\n*" as I documented
    earlier, but not this more general bug. If you work through the examples
    mentioned so far, I think they are all explained by one issue or the other.

    I'm inclined to think that the way to attack this is to invent an
    "iteration" subRE type, drop the two compilation-time transforms mentioned
    above in favor of just plopping an iteration subRE atop any messy
    expression that has to be quantified, and do the search work
    straightforwardly at runtime. However, this looks like a sizable amount of
    work that will result in a very nontrivial patch. So before starting, I'm
    soliciting some feedback about whether this seems like a sane solution that
    would get accepted.

     
  • You comprehend this far more deeply than I do, but yes, if it works we'll accept it. Do you want commit privileges to the repository so you can work on a public branch?

     
  • Tom Lane
    Tom Lane
    2012-02-14

    Yeah, that seems more reliable than trying to pass patches through this unfriendly tracker. What branch should I use?

     
  • I've committed tests (14.21 through 14.23 in reg.test) to the core-8-5-branch and trunk that characterize this bug. The one that _really_ characterizes the bug is marked with knownBug.

    The best place to start committing will be off the core-8-5-branch tip since we prefer to "merge forward"; contact me if you need any account setting up on core.tcl.tk

     
  • A side note, commenting on "But there are two significant regex features that DFA can't handle: capturing parentheses and back-references."

    While I agree with the above on the back-references the capturing parentheses can actually be handled by a DFA style RE matcher. Russ Cox, who wrote about the performance differences between automata and backtracking RE matchers at the beginning of 2007 wrote two followup articles about this a few years later, as I found out last week.

    His article describing how to handle capturing parens is at http://swtch.com/~rsc/regexp/regexp2.html, written at the end of 2009.