Thanks for explaining this to me. Obviously I missed the
"first match" rule, and

I must admit that I don't like it. In fact I was
relying on the "longest match" rule

in a piece of code that composes more
complex regular expressions from

simpler ones.

As an example, consider the Wildcard production from the XQuery spec:

[80] Wildcard ::= "*"

| (NCName ":" "*")

| ("*" ":" NCName )

Supposed you want to tranform this into a regular expression that
matches

the longest possible input, you might end up with something
like

"^(\*|N:\*|\*:N)"

where N is the subexpression corresponding to NCName.
Now under the "first

match" rule, I can't see a way to express "the longest
of either" in a single

regular
expression without understanding how the subexpressions might

overlap.

Of course the production can be rewritten to
completely avoid the overlap, but

my general approach breaks here, because I
cannot properly map the "|" EBNF

operator.

Best regards,

Gunther

This is as specified. See http://www.w3.org/TR/xpath-functions/#func-replace :

If two alternatives within the pattern both match at the
same position in the

`$input`

, then the match that is chosen is the one matched by the
first alternative.This rule also appears in my XPath book - page
448.

The "longest match" rule applies only to the interpretation
of quantifiers, not to the treatment of alternatives.

Michael Kay

From:saxon-help-bounces@lists.sourceforge.net [mailto:saxon-help-bounces@lists.sourceforge.net]On Behalf OfRademacher, GuntherSent:16 January 2008 02:03To:saxon-help@lists.sourceforge.netSubject:[saxon] Size of matches of a regular expressionMy understanding was that a regular expression will always match the longest

possible substring, unless the reluctant qualifiers are used, in which case it will

match as short as possible.Now I found that Saxon (tested with both 8.8 and 9.0.01) behaves differently, in

that it chooses the first matching branch of a choice, regardless of the length

consideration, e.g.replace("ABC", "A|AB", "X")

returns "XBC", but when using the longest match, it should be "XC". Similarly,

when the reluctant qualifier is used,replace("ABC", "(AB|A){1}?", "X")

returns "XC", but with the shortest possible match, it should be "XBC".

Best regards,

Gunther

Software AG – Sitz/Registered office: Uhlandstraße 12, 64297 Darmstadt, Germany, – Registergericht/Commercial register: Darmstadt HRB 1562 - Vorstand/ Management Board: Karl-Heinz Streibich (Vorsitzender/Chairman), David Broadbent, Mark Edwards, Dr. Peter Kürpick, David Mitchell, Arnd Zinnhardt; - Aufsichtsratsvorsitzender/ Chairman of the Supervisory Board: Frank F. Beelitz -http://www.softwareag.com