From: Rademacher, Gunther <Gunther.R<ademacher@so...>  20080116 02:03:25
Attachments:
Message as HTML

My understanding was that a regular expression will always match the longest possible substring, unless the reluctant qualifiers are used, in which case it will match as short as possible. Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g. replace("ABC", "AAB", "X") returns "XBC", but when using the longest match, it should be "XC". Similarly, when the reluctant qualifier is used, replace("ABC", "(ABA){1}?", "X") returns "XC", but with the shortest possible match, it should be "XBC". Best regards, Gunther =20 Software AG  Sitz/Registered office: Uhlandstra?e 12, 64297 Darmstadt, = Germany,  Registergericht/Commercial register: Darmstadt HRB 1562  = Vorstand/ Management Board: KarlHeinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter = Kurpick, David Mitchell, Arnd Zinnhardt;  Aufsichtsratsvorsitzender/ = Chairman of the Supervisory Board: Frank F. Beelitz  = http://www.softwareag.com 
From: Michael Kay <mike@sa...>  20080116 08:17:56
Attachments:
Message as HTML

This is as specified. See = http://www.w3.org/TR/xpathfunctions/#funcreplace : =20 If two alternatives within the pattern both match at the same position = in the $input, then the match that is chosen is the one matched by the = first alternative. =20 This rule also appears in my XPath book  page 448. =20 The "longest match" rule applies only to the interpretation of = quantifiers, not to the treatment of alternatives. =20 Michael Kay http://www.saxonica.com/ _____ =20 From: saxonhelpbounces@... [mailto:saxonhelpbounces@...] On Behalf Of = Rademacher, Gunther Sent: 16 January 2008 02:03 To: saxonhelp@... Subject: [saxon] Size of matches of a regular expression My understanding was that a regular expression will always match the = longest possible substring, unless the reluctant qualifiers are used, in which = case it will=20 match as short as possible.=20 Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g.=20 replace("ABC", "AAB", "X")=20 returns "XBC", but when using the longest match, it should be "XC". Similarly,=20 when the reluctant qualifier is used,=20 replace("ABC", "(ABA){1}?", "X")=20 returns "XC", but with the shortest possible match, it should be "XBC".=20 Best regards,=20 Gunther=20 Software AG =96 Sitz/Registered office: Uhlandstra=DFe 12, 64297 = Darmstadt, Germany, =96 Registergericht/Commercial register: Darmstadt HRB 1562  Vorstand/ Management Board: KarlHeinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter K=FCrpick, David Mitchell, Arnd Zinnhardt;  Aufsichtsratsvorsitzender/ Chairman of the Supervisory = Board: Frank F. Beelitz  <http://www.softwareag.com/>; = http://www.softwareag.com=20 
From: Rademacher, Gunther <Gunther.R<ademacher@so...>  20080116 12:28:17
Attachments:
Message as HTML

Thanks for explaining this to me. Obviously I missed the "first match" = rule, and I must admit that I don't like it. In fact I was relying on the "longest = match" rule=20 in a piece of code that composes more complex regular expressions from=20 simpler ones. As an example, consider the Wildcard production from the XQuery spec: [80] Wildcard ::=3D "*"  (NCName ":" "*")  ("*" ":" NCName ) Supposed you want to tranform this into a regular expression that = matches the longest possible input, you might end up with something like "^(\*N:\*\*:N)" where N is the subexpression corresponding to NCName. Now under the = "first=20 match" rule, I can't see a way to express "the longest of either" in a = single=20 regular expression without understanding how the subexpressions might overlap.=20 Of course the production can be rewritten to completely avoid the = overlap, but my general approach breaks here, because I cannot properly map the "" = EBNF=20 operator. Best regards, Gunther ________________________________ From: saxonhelpbounces@... = [mailto:saxonhelpbounces@...] On Behalf Of Michael = Kay Sent: Wednesday, January 16, 2008 9:18 AM To: 'Mailing list for SAXON XSLT queries' Subject: Re: [saxon] Size of matches of a regular expression This is as specified. See = http://www.w3.org/TR/xpathfunctions/#funcreplace : =20 If two alternatives within the pattern both match at the same position = in the $input, then the match that is chosen is the one matched by the = first alternative. =20 This rule also appears in my XPath book  page 448. =20 The "longest match" rule applies only to the interpretation of = quantifiers, not to the treatment of alternatives. =20 Michael Kay http://www.saxonica.com/ ________________________________ From: saxonhelpbounces@... = [mailto:saxonhelpbounces@...] On Behalf Of = Rademacher, Gunther Sent: 16 January 2008 02:03 To: saxonhelp@... Subject: [saxon] Size of matches of a regular expression =09 =09 My understanding was that a regular expression will always match the = longest=20 possible substring, unless the reluctant qualifiers are used, in which = case it will=20 match as short as possible.=20 Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves = differently, in=20 that it chooses the first matching branch of a choice, regardless of = the length=20 consideration, e.g.=20 replace("ABC", "AAB", "X")=20 returns "XBC", but when using the longest match, it should be "XC". = Similarly,=20 when the reluctant qualifier is used,=20 replace("ABC", "(ABA){1}?", "X")=20 returns "XC", but with the shortest possible match, it should be "XBC". = Best regards,=20 Gunther=20 =09 Software AG  Sitz/Registered office: Uhlandstra=DFe 12, 64297 = Darmstadt, Germany,  Registergericht/Commercial register: Darmstadt HRB = 1562  Vorstand/ Management Board: KarlHeinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter = K=FCrpick, David Mitchell, Arnd Zinnhardt;  Aufsichtsratsvorsitzender/ = Chairman of the Supervisory Board: Frank F. Beelitz  = http://www.softwareag.com <http://www.softwareag.com/>=20 =09 =09 =09 