Re: [saxon] Size of matches of a regular expression

 Re: [saxon] Size of matches of a regular expression From: Michael Kay - 2008-01-16 08:17:56 Attachments: Message as HTML ```This is as specified. See = http://www.w3.org/TR/xpath-functions/#func-replace : =20 If two alternatives within the pattern both match at the same position = in the \$input, then the match that is chosen is the one matched by the = first alternative. =20 This rule also appears in my XPath book - page 448. =20 The "longest match" rule applies only to the interpretation of = quantifiers, not to the treatment of alternatives. =20 Michael Kay http://www.saxonica.com/ _____ =20 From: saxon-help-bounces@... [mailto:saxon-help-bounces@...] On Behalf Of = Rademacher, Gunther Sent: 16 January 2008 02:03 To: saxon-help@... Subject: [saxon] Size of matches of a regular expression My understanding was that a regular expression will always match the = longest possible substring, unless the reluctant qualifiers are used, in which = case it will=20 match as short as possible.=20 Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g.=20 replace("ABC", "A|AB", "X")=20 returns "XBC", but when using the longest match, it should be "XC". Similarly,=20 when the reluctant qualifier is used,=20 replace("ABC", "(AB|A){1}?", "X")=20 returns "XC", but with the shortest possible match, it should be "XBC".=20 Best regards,=20 Gunther=20 Software AG =96 Sitz/Registered office: Uhlandstra=DFe 12, 64297 = Darmstadt, Germany, =96 Registergericht/Commercial register: Darmstadt HRB 1562 - Vorstand/ Management Board: Karl-Heinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter K=FCrpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ Chairman of the Supervisory = Board: Frank F. Beelitz - ; = http://www.softwareag.com=20 ```

 [saxon] Size of matches of a regular expression From: Rademacher, Gunther - 2008-01-16 02:03:25 Attachments: Message as HTML ```My understanding was that a regular expression will always match the longest possible substring, unless the reluctant qualifiers are used, in which case it will match as short as possible. Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g. replace("ABC", "A|AB", "X") returns "XBC", but when using the longest match, it should be "XC". Similarly, when the reluctant qualifier is used, replace("ABC", "(AB|A){1}?", "X") returns "XC", but with the shortest possible match, it should be "XBC". Best regards, Gunther =20 Software AG - Sitz/Registered office: Uhlandstra?e 12, 64297 Darmstadt, = Germany, - Registergericht/Commercial register: Darmstadt HRB 1562 - = Vorstand/ Management Board: Karl-Heinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter = Kurpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ = Chairman of the Supervisory Board: Frank F. Beelitz - = http://www.softwareag.com ```
 Re: [saxon] Size of matches of a regular expression From: Michael Kay - 2008-01-16 08:17:56 Attachments: Message as HTML ```This is as specified. See = http://www.w3.org/TR/xpath-functions/#func-replace : =20 If two alternatives within the pattern both match at the same position = in the \$input, then the match that is chosen is the one matched by the = first alternative. =20 This rule also appears in my XPath book - page 448. =20 The "longest match" rule applies only to the interpretation of = quantifiers, not to the treatment of alternatives. =20 Michael Kay http://www.saxonica.com/ _____ =20 From: saxon-help-bounces@... [mailto:saxon-help-bounces@...] On Behalf Of = Rademacher, Gunther Sent: 16 January 2008 02:03 To: saxon-help@... Subject: [saxon] Size of matches of a regular expression My understanding was that a regular expression will always match the = longest possible substring, unless the reluctant qualifiers are used, in which = case it will=20 match as short as possible.=20 Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g.=20 replace("ABC", "A|AB", "X")=20 returns "XBC", but when using the longest match, it should be "XC". Similarly,=20 when the reluctant qualifier is used,=20 replace("ABC", "(AB|A){1}?", "X")=20 returns "XC", but with the shortest possible match, it should be "XBC".=20 Best regards,=20 Gunther=20 Software AG =96 Sitz/Registered office: Uhlandstra=DFe 12, 64297 = Darmstadt, Germany, =96 Registergericht/Commercial register: Darmstadt HRB 1562 - Vorstand/ Management Board: Karl-Heinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter K=FCrpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ Chairman of the Supervisory = Board: Frank F. Beelitz - ; = http://www.softwareag.com=20 ```
 Re: [saxon] Size of matches of a regular expression From: Rademacher, Gunther - 2008-01-16 12:28:17 Attachments: Message as HTML ```Thanks for explaining this to me. Obviously I missed the "first match" = rule, and I must admit that I don't like it. In fact I was relying on the "longest = match" rule=20 in a piece of code that composes more complex regular expressions from=20 simpler ones. As an example, consider the Wildcard production from the XQuery spec: [80] Wildcard ::=3D "*" | (NCName ":" "*") | ("*" ":" NCName ) Supposed you want to tranform this into a regular expression that = matches the longest possible input, you might end up with something like "^(\*|N:\*|\*:N)" where N is the subexpression corresponding to NCName. Now under the = "first=20 match" rule, I can't see a way to express "the longest of either" in a = single=20 regular expression without understanding how the subexpressions might overlap.=20 Of course the production can be rewritten to completely avoid the = overlap, but my general approach breaks here, because I cannot properly map the "|" = EBNF=20 operator. Best regards, Gunther ________________________________ From: saxon-help-bounces@... = [mailto:saxon-help-bounces@...] On Behalf Of Michael = Kay Sent: Wednesday, January 16, 2008 9:18 AM To: 'Mailing list for SAXON XSLT queries' Subject: Re: [saxon] Size of matches of a regular expression This is as specified. See = http://www.w3.org/TR/xpath-functions/#func-replace : =20 If two alternatives within the pattern both match at the same position = in the \$input, then the match that is chosen is the one matched by the = first alternative. =20 This rule also appears in my XPath book - page 448. =20 The "longest match" rule applies only to the interpretation of = quantifiers, not to the treatment of alternatives. =20 Michael Kay http://www.saxonica.com/ ________________________________ From: saxon-help-bounces@... = [mailto:saxon-help-bounces@...] On Behalf Of = Rademacher, Gunther Sent: 16 January 2008 02:03 To: saxon-help@... Subject: [saxon] Size of matches of a regular expression =09 =09 My understanding was that a regular expression will always match the = longest=20 possible substring, unless the reluctant qualifiers are used, in which = case it will=20 match as short as possible.=20 Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves = differently, in=20 that it chooses the first matching branch of a choice, regardless of = the length=20 consideration, e.g.=20 replace("ABC", "A|AB", "X")=20 returns "XBC", but when using the longest match, it should be "XC". = Similarly,=20 when the reluctant qualifier is used,=20 replace("ABC", "(AB|A){1}?", "X")=20 returns "XC", but with the shortest possible match, it should be "XBC". = Best regards,=20 Gunther=20 =09 Software AG - Sitz/Registered office: Uhlandstra=DFe 12, 64297 = Darmstadt, Germany, - Registergericht/Commercial register: Darmstadt HRB = 1562 - Vorstand/ Management Board: Karl-Heinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter = K=FCrpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ = Chairman of the Supervisory Board: Frank F. Beelitz - = http://www.softwareag.com =20 =09 =09 =09 ```