#658 Hyphens in regular expressions

Michael Kay
Michael Kay

The schema spec has never been completely clear about exactly when and how hyphens are allowed within square brackets in a regular expression. There's a sorry history of errata being raised and then withdrawn, and the current spec is still self-contradictory. In consequence, the position in Saxon has never been very clear either.

In Saxon 8.9 this has been rationalized, with the aim being to err on the side of being permissive rather than restrictive.

A hyphen has three roles between [] in a regex. It can represent itself; it can act as a range operator [A-Z] or it can act as a subtraction operator [A-Z-[I]].

In Saxon 8.9 on Java 1.5, the rules are now:

"-" after a single character is recognized as a range operator

"-" before "[" is recognized as a subtraction operator

"-" anywhere else (including for example the second hyphen in [A-Z-0-9]) is recognized as representing itself.

This new code, however, was implemented only for the JDK 1.5 version of the regex parser, making this version inconsistent with JDK 1.4 and .NET. The new code has now been retrofitted to the Subversion source for the JDK 1.4 and .NET versions of the module.


  • Michael Kay
    Michael Kay

    Logged In: YES
    Originator: YES

    Fixed in