Name, NMTOKEN, NCName: wrong validation regexes

Brought to you by: pabigot

#134 Name, NMTOKEN, NCName: wrong validation regexes

Milestone: PyXB 1.1.4

Status: closed

Owner: Peter A. Bigot

Labels: None

Resolution: fixed

Component: Binding model

Priority: major

Version: PyXB 1.1.3

Type: defect

Updated: 2012-05-11

Created: 2012-05-05

Creator: Yuri Khan

Private: No

The grammar rules for Name and Nmtoken in the XML spec, and NCName in the XML Namespaces spec, are defined in terms of Letter, which is defined as a union of quite a bunch of character ranges from all over the Unicode repertoire.

PyXB, on the other hand, validates these types against substantially simpler regexes:

$ grep -n 'A-Za-z' datatypes.py 
920:    _ValidRE = re.compile('^[-_.:A-Za-z0-9]*$')
932:    _ValidRE = re.compile('^[A-Za-z_:][-_.:A-Za-z0-9]*$')
940:    _ValidRE = re.compile('^[A-Za-z_][-_.A-Za-z0-9]*$')

This causes PyXB-generated bindings to reject technically well-formed and valid documents that contain IDs in languages other than English.

Discussion

Peter A. Bigot - 2012-05-05

status changed from new to assigned

That's a valid complaint. The impact on performance when changing those REs to be more complex will have to be evaluated to see whether this needs to be a configuration option.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yuri Khan - 2012-05-05

Since the Python re library lacks Unicode property matching, the correct fix will require:

either depending on another regular expression library (regex?),

or hardcode a character set and deal with Unicode spec updates,

or generate regexes at binding generation time using unicodedata.

And regarding regex complexity, conformance beats performance any day.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-05-05

PyXB already has the infrastructure to manage Unicode character classes in order to support them in the XML regular expression language, so hooking it in for these case should not be too difficult. See pyxb.utils.unicode.

I agree about correctness superseding speed, but there is such a thing as a "performance bug", and if existing applications that don't need the feature get slowed down too much when the bug is fixed it's reasonable to provide an option to disable it so they can continue to use the tool. See, for example, #33, which used an algorithm that may have been more correct (the fix probably introduced #112) but made PyXB unusable for certain common situations.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-05-05

Based on a comment in pyxb.utils.unicode, apparently I already noticed this in the context of schema REs, but hadn't applied it to the validation of the schema documents themselves. Note that I'm now using the 5th edition of the XML spec, rather than the 2nd, and the set of valid characters is described differently.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-05-07

status changed from assigned to closed

resolution set to fixed

If you are able to use the version in the next branch of the PyXB git repository, please validate the following fix.

commit 03d8c8a5366c18998dc84b4eeeea000ce747d092
Author: Peter A. Bigot <pabigot@…>
Date: Mon May 7 15:43:18 2012 -0500

trac/134: Name, NMTOKEN, NCName: wrong validation regexes

In pyxb.utils.unicode, replace the inappropriate use of the rules from the
5th edition of the XML Specification with those from the 2nd edition, which
the XML Schema specification references. Update the validation expressions
using Name, NCName, and NmToken to refer to the official patterns.

Beware: This update removes wide unicode characters from the valid lexical
space for these datatypes and their derivatives.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yuri Khan - 2012-05-08

I will test the fix after the holidays.

I don’t believe I’ve ever seen the term “wide unicode character” before. From the source I understand you mean characters outside the Basic Multilingual Plane.

My understanding is that XML spec 2nd edition was published before the addition of higher planes to the Unicode standard. In fact, XML Schema spec is just lagging behind XML spec lagging behind Unicode spec. (In my opinion, the actual XML spec version to use would make a good customization point for the library.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-05-08

I don't really do much with unicode, hence the limited support (also, PyXB was initially developed at a time when Python's unicode support was still a little weak). "Wide unicode character" is what you describe; I adopted the term from http://www.python.org/dev/peps/pep-0261/. I'm a little concerned about that change, although it didn't break the test case for #108.

If there's a good reason to make the unicode version customizable, it'd be pretty simple to do: excepting for messes like the explicit code point sets in the second edition of XML, which I had to cut-and-paste-and-reformat, PyXB can build version-specific unicode data files from the Unicode raw data files quite easily.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yuri Khan - 2012-05-11

I’ve tested the fix and it seems to work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: