From: SourceForge.net <no...@so...> - 2010-08-11 21:20:29
|
Bugs item #2826551, was opened at 2009-07-24 14:53 Message generated for change (Comment added) made by lars_h You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2826551&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 43. Regexp Group: current: 8.6b1 Status: Open Resolution: None Priority: 9 Private: No Submitted By: Lars Hellström (lars_h) Assigned to: Pavel Goran (pvgoran) Summary: Line-sensitivity at -start, but also -all Initial Comment: I was going to use [regexp -all -inline] to split a unified diff into chunks, but found that this would skip every second chunk. This seems to be related to ^ not matching the beginning of line if the newline is before the -start position. SMALL EXAMPLE: set data { @1 2 +3 @4 -5 +6 7 @8 9 } regexp -all -inline {(?n)^@.*\n(?:[^@].*\n?)*} $data RETURNS {@1 2 +3 } {@8 9 } RATHER THAN {@1 2 +3 } {@4 -5 +6 7 } {@8 9 } EXPLANATION: A diff chunk consists of one header line beginning with @, followed by a body consisting of lines not beginning with @ (but rather space, plus, or minus). The header line is thus in line-sensitive mode matched by ^@.* and a body line is matched by ^[^@].*, so a sensible combined regexp is (?n)^@.*\n(?:[^@].*\n?)* Each chunk is indeed matched by this, but when -all is used, only every second chunk is found. In this case, dropping the ^ from the regexp allows all chunks to be found: % regexp -all -inline {(?n)@.*\n(?:[^@].*\n?)*} $data {@1 2 +3 } {@4 -5 +6 7 } {@8 9 } This suggests the problem may be that ^ fails to match when at the -start position, as one obvious way to implement [regexp -all] is as the equivalent of a loop around [regexp -start]. Indeed, % regexp -start 1 -indices {(?n)^a} "\nab" match 0 % regexp -start 0 -indices {(?n)^a} "\nab" match 1 % set match 1 1 ---------------------------------------------------------------------- >Comment By: Lars Hellström (lars_h) Date: 2010-08-11 23:20 Message: To answer dgp's question: the bug was originally encountered under 8.5 (possibly even 8.4), so a fix there would be nice too. Given how little tends to change in the regexp engine, I would be surprised if fixing 8.5 is different from fixing 8.6. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2010-08-09 18:23 Message: status? This bug is/was 8.6 branch only? Or is there still a need to fix a bug for new 8.5.* releases too? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2010-02-11 12:15 Message: That patch now applied. ---------------------------------------------------------------------- Comment By: Mo DeJong (mdejong) Date: 2010-02-09 12:28 Message: Added patch 2948425 to addres this issue. ---------------------------------------------------------------------- Comment By: Joachim Kock (jkock) Date: 2009-10-16 00:25 Message: Here is another description of the bug, and a contrast: In regexp with the -all option, it is clear that the matches must not overlap. But how is the rule if the expression starts with a constraint (width zero), like for example ^ or \m ? To my surprise you cannot find a linestart after a matched \n: % regexp -all -inline -indices {(?n)^a\n} "a\na\na\n" {0 1} {4 5} The behaviour I expected is the one seen with \m instead of ^: you _can_ find a wordstart after a matched \n: % regexp -all -inline -indices {(?n)\ma\n} "a\na\na\n" {0 1} {2 3} {4 5} So somehow the regexp engine is allowed to look back to see that a \n preceded and hence constitutes a word boundary, but it is not allowed to look back and detect that this same \n also constitutes a line boundary... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2826551&group_id=10894 |