Hello,
I'm using standard version 8.4.15 of tcl on a ubuntu 7.10 gutsy i386 operating system.
Code:
set line "is_a: GO:0048308 ! organelle inheritance"
regexp {^((?:\\:|[^:])+?): ((?:\\!|[^!])*)} $line regMatch tag tagValue
$tag contains "is_a" as expected but $tagValue is empty (should be "GO:0048308 ") like if the last * operator act as *? non-greedy operator
I have tested this code on different languages and tools and it worked.
Cheers,
Nicolas
Logged In: YES
user_id=80530
Originator: NO
You do know that different languages
have different specifications for what
they call "regexp"s, right? Especially
out in the vaguely specified realm of
greedy/non-greedy extensions to the
core regular expression functionality.
Can you explain in terms of Tcl's
regexp documentation
http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm
what you expect your pattern to match, and
how you see it fail?
Then we might get to the core of your
bug, or discover there's no bug at all.
Logged In: YES
user_id=79902
Originator: NO
FWIW, with that RE (and Tcl's RE engine) the right thing to do is to make the '+' quantifier greedy. Indeed, (non-)greediness isn't doing anything for you there because there's a strong bound on the matched string.
Logged In: YES
user_id=2019382
Originator: YES
>dkf: I can't make the '+' quantifier greedy because $line can contain something like "def: a definition with a semi colon : right in the middle". So $tag will be "def: a definition with a semi colon " instead of "def". Anyway I have corrected my parser by using 2 separate regexp.
>dgp : I have already read re_syntax and I think the problem is 'A branch has the same preference as the first quantified atom in it which has a preference'. But question is: why parenthesis aren't considered as new branches ? Does '|' have more precedence ?
Logged In: YES
user_id=79902
Originator: NO
It can't match past that first colon since no backslash is put first; your RE will only match it's first part up to the first colon without a \ before it, regardless of the greediness. Indeed, in the non-greedy case it could actually stop at the first colon even if there is a \ in front of it! That first clause should be greedy for all RE engines (try matching it against some real examples if you don't believe me).
The rest of it stems from the fact that the Tcl RE engine is an automata-theoretic engine that includes a strong optimization step, allowing Tcl to efficiently match things that other advanced RE engines (which tend to use recursive matchers behind the scenes) don't handle well at all. This means that the Tcl RE engine's interpretation of (non-)greediness switching is sometimes surprising. It's almost always the right thing to just use better greedy REs and get the anchoring correct; your RE is virtually there already. :-)
Logged In: YES
user_id=2019382
Originator: YES
>dkf: in fact this is a real example (sorry but biological file formats can be a nightmare...). The specification of OBO format v1.2 (http://www.geneontology.org/GO.format.obo-1_2.shtml#S.1.3) says that 'tag' can contain backslashed colon so the first branch is for backslashed colons and the other for all but colon characters. Then, I bound the non greedy expression with ': ' because the 'tag value' can contain colons too... The second part of my expression matches the tag value and trim a possible comment after a possible exclamation mark (and for fun, tag value can contains backslashed '!'). I hope my explanations are clear enough.
But it's ok. I resolved my problem in a different way. From here, I don't mix greedy and non-greedy expressions to avoid being confused :-D
Logged In: YES
user_id=1312539
Originator: NO
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).