#134 Name, NMTOKEN, NCName: wrong validation regexes

PyXB 1.1.4
Binding model
PyXB 1.1.3
Yuri Khan

The grammar rules for Name and Nmtoken in the XML spec, and NCName in the XML Namespaces spec, are defined in terms of Letter, which is defined as a union of quite a bunch of character ranges from all over the Unicode repertoire.

PyXB, on the other hand, validates these types against substantially simpler regexes:

$ grep -n 'A-Za-z' datatypes.py 
920:    _ValidRE = re.compile('^[-_.:A-Za-z0-9]*$')
932:    _ValidRE = re.compile('^[A-Za-z_:][-_.:A-Za-z0-9]*$')
940:    _ValidRE = re.compile('^[A-Za-z_][-_.A-Za-z0-9]*$')

This causes PyXB-generated bindings to reject technically well-formed and valid documents that contain IDs in languages other than English.


  • Peter A. Bigot

    Peter A. Bigot - 2012-05-05
    • status changed from new to assigned

    That's a valid complaint. The impact on performance when changing those REs to be more complex will have to be evaluated to see whether this needs to be a configuration option.

  • Yuri Khan

    Yuri Khan - 2012-05-05

    Since the Python re library lacks Unicode property matching, the correct fix will require:

    • either depending on another regular expression library (regex?),
    • or hardcode a character set and deal with Unicode spec updates,
    • or generate regexes at binding generation time using unicodedata.

    And regarding regex complexity, conformance beats performance any day.

  • Peter A. Bigot

    Peter A. Bigot - 2012-05-05

    PyXB already has the infrastructure to manage Unicode character classes in order to support them in the XML regular expression language, so hooking it in for these case should not be too difficult. See pyxb.utils.unicode.

    I agree about correctness superseding speed, but there is such a thing as a "performance bug", and if existing applications that don't need the feature get slowed down too much when the bug is fixed it's reasonable to provide an option to disable it so they can continue to use the tool. See, for example, #33, which used an algorithm that may have been more correct (the fix probably introduced #112) but made PyXB unusable for certain common situations.

  • Peter A. Bigot

    Peter A. Bigot - 2012-05-05

    Based on a comment in pyxb.utils.unicode, apparently I already noticed this in the context of schema REs, but hadn't applied it to the validation of the schema documents themselves. Note that I'm now using the 5th edition of the XML spec, rather than the 2nd, and the set of valid characters is described differently.

  • Peter A. Bigot

    Peter A. Bigot - 2012-05-07
    • status changed from assigned to closed
    • resolution set to fixed

    If you are able to use the version in the next branch of the PyXB git repository, please validate the following fix.

    commit 03d8c8a5366c18998dc84b4eeeea000ce747d092
    Author: Peter A. Bigot <pabigot@…>
    Date: Mon May 7 15:43:18 2012 -0500

    trac/134: Name, NMTOKEN, NCName: wrong validation regexes

    In pyxb.utils.unicode, replace the inappropriate use of the rules from the
    5th edition of the XML Specification with those from the 2nd edition, which
    the XML Schema specification references. Update the validation expressions
    using Name, NCName, and NmToken to refer to the official patterns.

    Beware: This update removes wide unicode characters from the valid lexical
    space for these datatypes and their derivatives.

  • Yuri Khan

    Yuri Khan - 2012-05-08

    I will test the fix after the holidays.

    I don’t believe I’ve ever seen the term “wide unicode character” before. From the source I understand you mean characters outside the Basic Multilingual Plane.

    My understanding is that XML spec 2nd edition was published before the addition of higher planes to the Unicode standard. In fact, XML Schema spec is just lagging behind XML spec lagging behind Unicode spec. (In my opinion, the actual XML spec version to use would make a good customization point for the library.)

  • Peter A. Bigot

    Peter A. Bigot - 2012-05-08

    I don't really do much with unicode, hence the limited support (also, PyXB was initially developed at a time when Python's unicode support was still a little weak). "Wide unicode character" is what you describe; I adopted the term from http://www.python.org/dev/peps/pep-0261/. I'm a little concerned about that change, although it didn't break the test case for #108.

    If there's a good reason to make the unicode version customizable, it'd be pretty simple to do: excepting for messes like the explicit code point sets in the second edition of XML, which I had to cut-and-paste-and-reformat, PyXB can build version-specific unicode data files from the Unicode raw data files quite easily.

  • Yuri Khan

    Yuri Khan - 2012-05-11

    I’ve tested the fix and it seems to work.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks