Name, NMTOKEN, NCName: wrong validation regexes
Brought to you by:
pabigot
The grammar rules for Name and Nmtoken in the XML spec, and NCName in the XML Namespaces spec, are defined in terms of Letter, which is defined as a union of quite a bunch of character ranges from all over the Unicode repertoire.
PyXB, on the other hand, validates these types against substantially simpler regexes:
$ grep -n 'A-Za-z' datatypes.py
920: _ValidRE = re.compile('^[-_.:A-Za-z0-9]*$')
932: _ValidRE = re.compile('^[A-Za-z_:][-_.:A-Za-z0-9]*$')
940: _ValidRE = re.compile('^[A-Za-z_][-_.A-Za-z0-9]*$')
This causes PyXB-generated bindings to reject technically well-formed and valid documents that contain IDs in languages other than English.
That's a valid complaint. The impact on performance when changing those REs to be more complex will have to be evaluated to see whether this needs to be a configuration option.
Since the Python
relibrary lacks Unicode property matching, the correct fix will require:regex?),unicodedata.And regarding regex complexity, conformance beats performance any day.
PyXB already has the infrastructure to manage Unicode character classes in order to support them in the XML regular expression language, so hooking it in for these case should not be too difficult. See pyxb.utils.unicode.
I agree about correctness superseding speed, but there is such a thing as a "performance bug", and if existing applications that don't need the feature get slowed down too much when the bug is fixed it's reasonable to provide an option to disable it so they can continue to use the tool. See, for example, #33, which used an algorithm that may have been more correct (the fix probably introduced #112) but made PyXB unusable for certain common situations.
Based on a comment in pyxb.utils.unicode, apparently I already noticed this in the context of schema REs, but hadn't applied it to the validation of the schema documents themselves. Note that I'm now using the 5th edition of the XML spec, rather than the 2nd, and the set of valid characters is described differently.
If you are able to use the version in the next branch of the PyXB git repository, please validate the following fix.
commit 03d8c8a5366c18998dc84b4eeeea000ce747d092
Author: Peter A. Bigot <pabigot@…>
Date: Mon May 7 15:43:18 2012 -0500
I will test the fix after the holidays.
I don’t believe I’ve ever seen the term “wide unicode character” before. From the source I understand you mean characters outside the Basic Multilingual Plane.
My understanding is that XML spec 2nd edition was published before the addition of higher planes to the Unicode standard. In fact, XML Schema spec is just lagging behind XML spec lagging behind Unicode spec. (In my opinion, the actual XML spec version to use would make a good customization point for the library.)
I don't really do much with unicode, hence the limited support (also, PyXB was initially developed at a time when Python's unicode support was still a little weak). "Wide unicode character" is what you describe; I adopted the term from http://www.python.org/dev/peps/pep-0261/. I'm a little concerned about that change, although it didn't break the test case for #108.
If there's a good reason to make the unicode version customizable, it'd be pretty simple to do: excepting for messes like the explicit code point sets in the second edition of XML, which I had to cut-and-paste-and-reformat, PyXB can build version-specific unicode data files from the Unicode raw data files quite easily.
I’ve tested the fix and it seems to work.