Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#636 regex backreferences to undefined groups

v8.8
closed
Michael Kay
5
2012-10-08
2006-12-21
Michael Kay
No

In a regular expression, when a backreference exists and the group number does not correspond to any existing group (for example "(.)\2"), the specification states that the backreference should be treated as matching a zero-length string (which means in practice that it is treated as if it were not there.

For a single-digit backreference, Saxon is translating the backreference into an identical Java or .NET backreference. Although this ought to be OK according to the Java spec, it isn't: Java for some reason accepts "(.)\2" but rejects "(.)\3" as a syntax error.

Under JDK 1.4 and .NET (but not JDK 1.5) Saxon inserts the string ".{0}" after a backreference, to ensure that subsequent digits are not treated as part of the backreference. This is incorrect if the backreference is followed by a quantifier. It should also be unnecessary.

In the course of investigating this I found an oddity in the spec: it seems that backreferences are allowed within square brackets (for example "(abc)[\1]") but with no defined semantics. Normally things within square brackets are constrained to match a single character. I have raised this as a bug on the spec, see

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4106

The above errors being patched in Subversion.

Another deviation from the specification is that Saxon does not allow more than two digits in a back-reference. This is not being patched at this stage.

Discussion