Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#3932 Regexp error

obsolete: 8.4.15
closed-invalid
Don Porter
5
2008-03-26
2008-02-25
No

Hello,

I'm using standard version 8.4.15 of tcl on a ubuntu 7.10 gutsy i386 operating system.

Code:

set line "is_a: GO:0048308 ! organelle inheritance"
regexp {^((?:\\:|[^:])+?): ((?:\\!|[^!])*)} $line regMatch tag tagValue

$tag contains "is_a" as expected but $tagValue is empty (should be "GO:0048308 ") like if the last * operator act as *? non-greedy operator

I have tested this code on different languages and tools and it worked.

Cheers,
Nicolas

Discussion

  • Don Porter
    Don Porter
    2008-02-26

    Logged In: YES
    user_id=80530
    Originator: NO

    You do know that different languages
    have different specifications for what
    they call "regexp"s, right? Especially
    out in the vaguely specified realm of
    greedy/non-greedy extensions to the
    core regular expression functionality.

    Can you explain in terms of Tcl's
    regexp documentation

    http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm

    what you expect your pattern to match, and
    how you see it fail?
    Then we might get to the core of your
    bug, or discover there's no bug at all.

     
  • Logged In: YES
    user_id=79902
    Originator: NO

    FWIW, with that RE (and Tcl's RE engine) the right thing to do is to make the '+' quantifier greedy. Indeed, (non-)greediness isn't doing anything for you there because there's a strong bound on the matched string.

     
  • Logged In: YES
    user_id=2019382
    Originator: YES

    >dkf: I can't make the '+' quantifier greedy because $line can contain something like "def: a definition with a semi colon : right in the middle". So $tag will be "def: a definition with a semi colon " instead of "def". Anyway I have corrected my parser by using 2 separate regexp.

    >dgp : I have already read re_syntax and I think the problem is 'A branch has the same preference as the first quantified atom in it which has a preference'. But question is: why parenthesis aren't considered as new branches ? Does '|' have more precedence ?

     
  • Logged In: YES
    user_id=79902
    Originator: NO

    It can't match past that first colon since no backslash is put first; your RE will only match it's first part up to the first colon without a \ before it, regardless of the greediness. Indeed, in the non-greedy case it could actually stop at the first colon even if there is a \ in front of it! That first clause should be greedy for all RE engines (try matching it against some real examples if you don't believe me).

    The rest of it stems from the fact that the Tcl RE engine is an automata-theoretic engine that includes a strong optimization step, allowing Tcl to efficiently match things that other advanced RE engines (which tend to use recursive matchers behind the scenes) don't handle well at all. This means that the Tcl RE engine's interpretation of (non-)greediness switching is sometimes surprising. It's almost always the right thing to just use better greedy REs and get the anchoring correct; your RE is virtually there already. :-)

     
  • Logged In: YES
    user_id=2019382
    Originator: YES

    >dkf: in fact this is a real example (sorry but biological file formats can be a nightmare...). The specification of OBO format v1.2 (http://www.geneontology.org/GO.format.obo-1_2.shtml#S.1.3) says that 'tag' can contain backslashed colon so the first branch is for backslashed colons and the other for all but colon characters. Then, I bound the non greedy expression with ': ' because the 'tag value' can contain colons too... The second part of my expression matches the tag value and trim a possible comment after a possible exclamation mark (and for fun, tag value can contains backslashed '!'). I hope my explanations are clear enough.

    But it's ok. I resolved my problem in a different way. From here, I don't mix greedy and non-greedy expressions to avoid being confused :-D

     
  • Don Porter
    Don Porter
    2008-03-11

    • assigned_to: pvgoran --> dgp
    • status: open --> pending-invalid
     
    • status: pending-invalid --> closed-invalid
     
  • Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).