From: Rademacher, Gunther <Gunther.R<ademacher@so...>  20080116 02:03:25
My understanding was that a regular expression will always match the longest possible substring, unless the reluctant qualifiers are used, in which case it will match as short as possible. Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in=20 that it chooses the first matching branch of a choice, regardless of the length=20 consideration, e.g. replace("ABC", "AAB", "X") returns "XBC", but when using the longest match, it should be "XC". Similarly, when the reluctant qualifier is used, replace("ABC", "(ABA){1}?", "X") returns "XC", but with the shortest possible match, it should be "XBC". Best regards, Gunther =20 Software AG  Sitz/Registered office: Uhlandstra?e 12, 64297 Darmstadt, = Germany,  Registergericht/Commercial register: Darmstadt HRB 1562  = Vorstand/ Management Board: KarlHeinz Streibich = (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter = Kurpick, David Mitchell, Arnd Zinnhardt;  Aufsichtsratsvorsitzender/ = Chairman of the Supervisory Board: Frank F. Beelitz  = http://www.softwareag.com 
From: Michael Kay <mike@sa...>  20080116 08:17:56
This is as specified. See http://www.w3.org/TR/xpathfunctions/#funcreplace : 
 
If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative. 
 
This rule also appears in my XPath book  page 448. 
 
The "longest match" rule applies only to the interpretation of quantifiers, not to the treatment of alternatives. 
 
Michael Kay http://www.saxonica.com/ 
From: Rademacher, Gunther <Gunther.R<ademacher@so...>  20080116 12:28:17
Thanks for explaining this to me. Obviously I missed the "first match" rule, and I must admit that I don't like it. In fact I was relying on the "longest match" rule 
in a piece of code that composes more complex regular expressions from 
simpler ones. As an example, consider the Wildcard production from the XQuery spec: [80] Wildcard ::=3D "*"  (NCName ":" "*")  ("*" ":" NCName ) Supposed you want to tranform this into a regular expression that matches the longest possible input, you might end up with something like "^(\*N:\*\*:N)" where N is the subexpression corresponding to NCName. Now under the "first 
match" rule, I can't see a way to express "the longest of either" in a single 
regular expression without understanding how the subexpressions might overlap. 
Of course the production can be rewritten to completely avoid the overlap, but my general approach breaks here, because I cannot properly map the "" EBNF 
operator. Best regards, Gunther 