#621 Hyphens in regular expressions

Michael Kay
Michael Kay

This affects regular expressions in both XML Schema and XPath (and therefore also XSLT), and it affects all platforms.

The Schema spec is very muddled about exactly where unescaped hyphens can be used in a character range (that is, within square brackets) regular expression. At one stage (by means of an erratum) they were banned entirely, except for their regular use to define a range (such as [0-9]) or a subtraction (such as [A-Z-[io]]). This erratum was later withdrawn, but the rules in the second edition of XML Schema Part 2 are still unclear. The grammar says they are allowed, one note in the text says they are not allowed, and another note says they are allowed at the beginning or end of a positive character group only.

In Saxon (8.8 and probably earlier releases) an unescaped hyphen works as expected at the start of a character group, but at the end of a character group it is accepted but ignored. So the syntax [+-][0-9]* is accepted, but it will not match a string that starts with a hyphen or minus sign. A patch is being placed in Subversion to fix this. With this patch, the rule is that an unescaped hyphen represents a hyphen character if it appears at the start of a positive character group (that is, after '[' or '^') or at the end (before ']' or a nested '[').

With this patch, a hyphen is accepted as the start of a range s-e, but not as the end. There's no particular justification for this, but it will do until W3C clarify the rules. A hyphen that appears immediately after a range (for example [A-Z-... ) is taken as a subtraction operator, causing an error if it is not immediately followed by '['.