Menu

#1566 Capturing parens are very slow

obsolete: 8.4.4
open
3
2003-11-13
2001-07-31
Anonymous
No

See attached archive (sample script with text data).
If you can't see attached file, look at
http://ifast.narod.ru/files/regbug.zip

Discussion

  • miguel sofer

    miguel sofer - 2001-07-31
    • labels: 105659 --> 43. Regexp
     
  • miguel sofer

    miguel sofer - 2001-07-31

    Logged In: YES
    user_id=148712

    Bug confirmed: tclsh seems to get in an infinite loop.
    Attaching the file now

     
  • miguel sofer

    miguel sofer - 2001-07-31
     
  • miguel sofer

    miguel sofer - 2001-07-31

    Logged In: YES
    user_id=148712

    The regular expression

    {<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>}
    is invalid: it contains a constraint (the lookahead)
    followed by a quantifier, which is expressly forbidden in
    the man page for re_syntax.

    I think the author actually wanted something like
    {<script
    (?=.*</script>).*?(doubleclick|flycast|burstnet|spylog).*?</script>}
    which runs OK (albeit slowly; there is probably a better way
    to express this).

    OTOH, the regexp engine should complain about the bad syntax
    and return an error, instead of getting into an infinite
    loop.

     
  • Sergey Kuzmin

    Sergey Kuzmin - 2001-08-03

    Logged In: YES
    user_id=286083

    >The regular expression
    >{<script(.(?!</script>))*?
    >(doubleclick|flycast|burstnet|spylog)+?.*?</script>}
    >is invalid: it contains a constraint (the lookahead)
    >followed by a quantifier, which is expressly forbidden in
    >the man page for re_syntax.

    You aren't right. If you follow lookahead constraint
    directly (NB!) with quantifier, interpreter reports an
    error. And my construction works fine (interpreter doesn't
    report any error), because lookahead atom doesn't follow
    any quantifier. But works quickly only on small buffer.
    Also, if you wait a long-long time (you can decrease text
    size between <script> and </script> in example),
    interpreter un-hungs, and return correct result.

    >I think the author actually wanted something like
    >{<script(?=.*</script>).*?
    >(doubleclick|flycast|burstnet|spylog).*?</script>}
    >which runs OK (albeit slowly; there is probably a
    >better way to express this).

    You aren't right again. :)
    Try:
    # my expression
    set my {<script(.(?!</script>))*?(123)+?.*?</script>}
    # your expression
    set mo {<script(?=.*</script>).*?(123).*?</script>}
    # sample text
    set text {<script> 123 </script> <script> 345 </script>
    <script> 123 </script>}

    # Correct result
    regsub -all $my $text --- res
    2
    set res
    --- <script> 345 </script> ---

    # Not correct result:
    regsub -all $mo $text --- res
    2
    set res
    --- ---

    Btw, if you know, how to get first result without using of
    constraint lookahead with non-greedy quantifier, let me
    know. :)

    >OTOH, the regexp engine should complain about the
    >bad syntax
    >and return an error, instead of getting into an infinite
    >loop.

    IMHO, itsn't infinite loop, but only very slow work with
    lookahead constraints.

     
  • miguel sofer

    miguel sofer - 2001-08-06

    Logged In: YES
    user_id=148712

    Indeed, I am both wrong and sorry.

    The "capturing parens" are very costly in this case;
    replacing them with the "non-capturing" variant produces the
    desired result. The regex becomes then

    {<script(?:.(?!</script>))*?(?:doubleclick|flycast|burstnet|spylog)+?.*?</script>}

     
  • miguel sofer

    miguel sofer - 2001-08-13
    • milestone: 102437 -->
     
  • miguel sofer

    miguel sofer - 2001-08-13
    • priority: 5 --> 3
    • summary: Interpreter hungs on regexp --> Capturing parens are very slow
     
  • miguel sofer

    miguel sofer - 2001-08-13

    Logged In: YES
    user_id=148712

    The originally submitted script does perform as intended,
    only very slowly. The use of non-capturing parens solves the
    issue.

    It now seems clear that the bug is really a performance and
    not a functionality issue.

    I am therefore changing the summary description and reducing
    the priority.

     
  • Don Porter

    Don Porter - 2003-11-12
    • assigned_to: nobody --> pvgoran
     
  • Don Porter

    Don Porter - 2003-11-13
    • milestone: --> obsolete: 8.4.4