The spec says the regex in the match pattern must be a regular expression according to W3C XML Schema Language. But that dialect of regex was tuned for a significantly different use. It has features not in mainline regex processors such as XSLT 2.0 and Javascript. And it takes some things as implicit that must be explicit in other regex dialects. For example, XML Schema's regex presupposes what in other regex dialects would be a ^ at the beginning and a $ at the end.
A helpful comparison chart is here: http://www.regular-expressions.info/refflavors.html
The lack of opening and closing anchors would probably be the most trouble-making difference.
I suggest the current language be changed to:
@matchPattern should use only common-denominator features widely available in regular expression processors.
Unfortunately, there is no spec for a common-denominator subset. If it's felt the TEI spec must cite some standard and preferably one in the XML family, cite XPath 2.0. Most of the unique features it has (such as its Unicode support) are unlikely to be used on datapointers.
I think the XML Schema flavour of regex is suitable for this application; remember, we're presupposing a very short string which constitutes a token or a sequence of tokens (that's the point -- it's a method of keeping URIs short), and the XML Schema language is well specified and suitable for this. If there were an established standard which specified a common-denominator subset, it would be worth considering, but right now I think what we have is clear and understandable.
Checking back, this was also inherited from cRefPattern/@matchPattern. If we were to change it, we'd have to make sure we didn't make values in existing documents technically invalid (although there's no way I know to check the validity of a regular expression using a regular expression, and no other way to check it that I can think of).
When Sebastian (using XSLT) and I (using Javascript) independently started using this element, we both inadvertently started with reg ex expressions that turned out to be invalid but expressions that could be processed using the language we were using. What programming language can accurately process a match/replacement using unmodified XML Schema regular expressions? I think none. If that's right, it seems worth noting somewhere.
If we do standardize on something other than XML Schema regex, I think it should probably be XPath, since that's the rendering method most likely to be applied. So we could refer to this:
http://www.w3.org/TR/xpath-functions/#regex-syntax
I think we would also have to consider the option of an additional replacementFlags attribute.
I think XPath (or specifically XPath 2.0) is a good choice. It has a few features that other dialects don't have but anyone using them is sophisticated enough to know that there are regex dialects and to know the risks of using unique features.
There seems to be agreement on the proposed changes here. Marking ticket green for implemnentation.
There seems to be agreement that using regex from the current XPath Recommendation is a good idea (is this better than specifying a particular version -- XPath 2.0)? But in general agreement. Lou marked as GREEN so assigning to him.
I agree with "XPath", rather than specifying a version and having to re-evaluate periodically when versions change. Since we can't validate this anyway, as far as I can see, no schema/validation issues should arise.
Same as bugs/601
Council 2013-11-13: MH will write to TEI-L to check whether anyone has actually depended on the limitations of the XML Schema version of regex; if not, implement this. Noted that processing in the Stylesheets is already being done with XSLT2, so is assuming XPath regex patterns.
Message sent to TEI-L 2013-11-13. If no objections by 2013-11-20, the change should be made.
No regexp available in:
- XPath 1.0
- XSLT 1.0
They appear in "XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition)", http://www.w3.org/TR/xpath-functions/#regex-syntax
as enhanced XML schema regexp (with ^, $, grouping, etc.).
XQuery-XPath is probably the most frequent environment in which people will need to match.
Contender "standard" technologies could have been:
- PCRE : http://www.pcre.org
- ICU : http://userguide.icu-project.org/strings/regexp
We're not in the business of linking this to any particular technology. We're just saying "this is a regexp, according to the conventions of XPath; evaluate it however you find convenient". If you use pure XSLT 1.0, it is likely you won't be able to produce an implementation of this TEI markup, but that is true whichever regexp notation we chose.
OK, let's keep designing ethereal conventions.
"ethereal conventions"? that's deeply unfair, Serge! this TEI feature simply uses regular expressions, which have been more or less unchanged for 30 years or more, implemented in almost every language under the sun. so XSLT 1.0 does not support them - it is the outlier, not the TEI.
All we have done here is say which formulation of regex rules to follow, but the vast vast majority of regex is the same the world over. We are NOT, at all, saying that you must use XSLT 2/ XPath 2 to implement this. Its just as easy to do in Javascript, as John M notes in one of these tickets.
Serge, I get that you're annoyed about something, but I'm not really clear on exactly what is upsetting you. Are you saying that we must specify e.g. XPath 2.0? Since there's no way (to my knowledge) to validate a regular expression with a schema or a Schematron rule, I see no reason why we shouldn't leave this to the user; right now it could only mean XPath 2.0, but a future version of XPath might introduce more features, and I see no reason not to allow people to use them without our having to change our specification. We say that @style contains (by default) CSS code, but we don't specify a CSS version, or a specific collection of CSS modules. Anyone who wants to be more precise than P5 cares to be can be so in their ODD, surely?
Martin, your suggestion is sound and I completely agree with it.
Implemented rev 12660. Closing the ticket.