Menu

#432 Change regex flavor on @matchPattern

GREEN
closed-fixed
None
1(low)
2013-11-21
2013-02-05
No

The spec says the regex in the match pattern must be a regular expression according to W3C XML Schema Language. But that dialect of regex was tuned for a significantly different use. It has features not in mainline regex processors such as XSLT 2.0 and Javascript. And it takes some things as implicit that must be explicit in other regex dialects. For example, XML Schema's regex presupposes what in other regex dialects would be a ^ at the beginning and a $ at the end.

A helpful comparison chart is here: http://www.regular-expressions.info/refflavors.html

The lack of opening and closing anchors would probably be the most trouble-making difference.

I suggest the current language be changed to:

@matchPattern should use only common-denominator features widely available in regular expression processors.

Unfortunately, there is no spec for a common-denominator subset. If it's felt the TEI spec must cite some standard and preferably one in the XML family, cite XPath 2.0. Most of the unique features it has (such as its Unicode support) are unlikely to be used on datapointers.

Discussion

  • Martin Holmes

    Martin Holmes - 2013-02-05

    I think the XML Schema flavour of regex is suitable for this application; remember, we're presupposing a very short string which constitutes a token or a sequence of tokens (that's the point -- it's a method of keeping URIs short), and the XML Schema language is well specified and suitable for this. If there were an established standard which specified a common-denominator subset, it would be worth considering, but right now I think what we have is clear and understandable.

     
  • Martin Holmes

    Martin Holmes - 2013-02-05

    Checking back, this was also inherited from cRefPattern/@matchPattern. If we were to change it, we'd have to make sure we didn't make values in existing documents technically invalid (although there's no way I know to check the validity of a regular expression using a regular expression, and no other way to check it that I can think of).

     
  • John P. McCaskey

    When Sebastian (using XSLT) and I (using Javascript) independently started using this element, we both inadvertently started with reg ex expressions that turned out to be invalid but expressions that could be processed using the language we were using. What programming language can accurately process a match/replacement using unmodified XML Schema regular expressions? I think none. If that's right, it seems worth noting somewhere.

     
  • Martin Holmes

    Martin Holmes - 2013-02-05

    If we do standardize on something other than XML Schema regex, I think it should probably be XPath, since that's the rendering method most likely to be applied. So we could refer to this:

    http://www.w3.org/TR/xpath-functions/#regex-syntax

    I think we would also have to consider the option of an additional replacementFlags attribute.

     
  • John P. McCaskey

    I think XPath (or specifically XPath 2.0) is a good choice. It has a few features that other dialects don't have but anyone using them is sophisticated enough to know that there are regex dialects and to know the risks of using unique features.

     
  • Lou Burnard

    Lou Burnard - 2013-03-30
    • labels: TEI: New or Changed Element -->
    • milestone: --> GREEN
     
  • Lou Burnard

    Lou Burnard - 2013-03-30

    There seems to be agreement on the proposed changes here. Marking ticket green for implemnentation.

     
  • James Cummings

    James Cummings - 2013-11-09
    • assigned_to: Lou Burnard
    • Priority: 5 --> 1(low)
     
  • James Cummings

    James Cummings - 2013-11-09

    There seems to be agreement that using regex from the current XPath Recommendation is a good idea (is this better than specifying a particular version -- XPath 2.0)? But in general agreement. Lou marked as GREEN so assigning to him.

     
  • Martin Holmes

    Martin Holmes - 2013-11-09

    I agree with "XPath", rather than specifying a version and having to re-evaluate periodically when versions change. Since we can't validate this anyway, as far as I can see, no schema/validation issues should arise.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-11-13

    Same as bugs/601

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-11-13
     
  • Martin Holmes

    Martin Holmes - 2013-11-13

    Council 2013-11-13: MH will write to TEI-L to check whether anyone has actually depended on the limitations of the XML Schema version of regex; if not, implement this. Noted that processing in the Stylesheets is already being done with XSLT2, so is assuming XPath regex patterns.

     
  • Martin Holmes

    Martin Holmes - 2013-11-13
    • assigned_to: Lou Burnard --> Martin Holmes
     
  • Martin Holmes

    Martin Holmes - 2013-11-13

    Message sent to TEI-L 2013-11-13. If no objections by 2013-11-20, the change should be made.

     
  • Serge Heiden

    Serge Heiden - 2013-11-13

    No regexp available in:
    - XPath 1.0
    - XSLT 1.0
    They appear in "XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition)", http://www.w3.org/TR/xpath-functions/#regex-syntax
    as enhanced XML schema regexp (with ^, $, grouping, etc.).

    XQuery-XPath is probably the most frequent environment in which people will need to match.

    Contender "standard" technologies could have been:
    - PCRE : http://www.pcre.org
    - ICU : http://userguide.icu-project.org/strings/regexp

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-11-13

    We're not in the business of linking this to any particular technology. We're just saying "this is a regexp, according to the conventions of XPath; evaluate it however you find convenient". If you use pure XSLT 1.0, it is likely you won't be able to produce an implementation of this TEI markup, but that is true whichever regexp notation we chose.

     
  • Serge Heiden

    Serge Heiden - 2013-11-13

    OK, let's keep designing ethereal conventions.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-11-13

    "ethereal conventions"? that's deeply unfair, Serge! this TEI feature simply uses regular expressions, which have been more or less unchanged for 30 years or more, implemented in almost every language under the sun. so XSLT 1.0 does not support them - it is the outlier, not the TEI.

    All we have done here is say which formulation of regex rules to follow, but the vast vast majority of regex is the same the world over. We are NOT, at all, saying that you must use XSLT 2/ XPath 2 to implement this. Its just as easy to do in Javascript, as John M notes in one of these tickets.

     
  • Martin Holmes

    Martin Holmes - 2013-11-13

    Serge, I get that you're annoyed about something, but I'm not really clear on exactly what is upsetting you. Are you saying that we must specify e.g. XPath 2.0? Since there's no way (to my knowledge) to validate a regular expression with a schema or a Schematron rule, I see no reason why we shouldn't leave this to the user; right now it could only mean XPath 2.0, but a future version of XPath might introduce more features, and I see no reason not to allow people to use them without our having to change our specification. We say that @style contains (by default) CSS code, but we don't specify a CSS version, or a specific collection of CSS modules. Anyone who wants to be more precise than P5 cares to be can be so in their ODD, surely?

     
  • Serge Heiden

    Serge Heiden - 2013-11-13

    Martin, your suggestion is sound and I completely agree with it.

     
  • Martin Holmes

    Martin Holmes - 2013-11-21

    Implemented rev 12660. Closing the ticket.

     
  • Martin Holmes

    Martin Holmes - 2013-11-21
    • status: open --> closed-fixed