#636 regex backreferences to undefined groups

Michael Kay

In a regular expression, when a backreference exists and the group number does not correspond to any existing group (for example "(.)\2"), the specification states that the backreference should be treated as matching a zero-length string (which means in practice that it is treated as if it were not there.

For a single-digit backreference, Saxon is translating the backreference into an identical Java or .NET backreference. Although this ought to be OK according to the Java spec, it isn't: Java for some reason accepts "(.)\2" but rejects "(.)\3" as a syntax error.

Under JDK 1.4 and .NET (but not JDK 1.5) Saxon inserts the string ".{0}" after a backreference, to ensure that subsequent digits are not treated as part of the backreference. This is incorrect if the backreference is followed by a quantifier. It should also be unnecessary.

In the course of investigating this I found an oddity in the spec: it seems that backreferences are allowed within square brackets (for example "(abc)[\1]") but with no defined semantics. Normally things within square brackets are constrained to match a single character. I have raised this as a bug on the spec, see


The above errors being patched in Subversion.

Another deviation from the specification is that Saxon does not allow more than two digits in a back-reference. This is not being patched at this stage.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks