Thanks for explaining this to me. Obviously I missed the "first match" rule, and
I must admit that I don't like it. In fact I was relying on the "longest match" rule
in a piece of code that composes more complex regular expressions from
simpler ones.

As an example, consider the Wildcard production from the XQuery spec:

[80] Wildcard ::= "*"
                | (NCName ":" "*")
                | ("*" ":" NCName )

Supposed you want to tranform this into a regular expression that matches
the longest possible input, you might end up with something like

"^(\*|N:\*|\*:N)"

where N is the subexpression corresponding to NCName. Now under the "first
match" rule, I can't see a way to express "the longest of either" in a single
regular expression without understanding how the subexpressions might
overlap.

Of course the production can be rewritten to completely avoid the overlap, but
my general approach breaks here, because I cannot properly map the "|" EBNF
operator.

Best regards,
Gunther



From: saxon-help-bounces@lists.sourceforge.net [mailto:saxon-help-bounces@lists.sourceforge.net] On Behalf Of Michael Kay
Sent: Wednesday, January 16, 2008 9:18 AM
To: 'Mailing list for SAXON XSLT queries'
Subject: Re: [saxon] Size of matches of a regular expression

This is as specified. See http://www.w3.org/TR/xpath-functions/#func-replace :
 
If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative.
 
This rule also appears in my XPath book - page 448.
 
The "longest match" rule applies only to the interpretation of quantifiers, not to the treatment of alternatives.
 
Michael Kay
http://www.saxonica.com/


From: saxon-help-bounces@lists.sourceforge.net [mailto:saxon-help-bounces@lists.sourceforge.net] On Behalf Of Rademacher, Gunther
Sent: 16 January 2008 02:03
To: saxon-help@lists.sourceforge.net
Subject: [saxon] Size of matches of a regular expression

My understanding was that a regular expression will always match the longest
possible substring, unless the reluctant qualifiers are used, in which case it will
match as short as possible.

Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in
that it chooses the first matching branch of a choice, regardless of the length
consideration, e.g.

        replace("ABC", "A|AB", "X")

returns "XBC", but when using the longest match, it should be "XC". Similarly,
when the reluctant qualifier is used,

        replace("ABC", "(AB|A){1}?", "X")

returns "XC", but with the shortest possible match, it should be "XBC".

Best regards,
Gunther


Software AG – Sitz/Registered office: Uhlandstraße 12, 64297 Darmstadt, Germany, – Registergericht/Commercial register: Darmstadt HRB 1562 - Vorstand/ Management Board: Karl-Heinz Streibich (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter Kürpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ Chairman of the Supervisory Board: Frank F. Beelitz - http://www.softwareag.com