UTF-8 Characters don't store correctly
Brought to you by:
pabigot
The pyxb API doesn't cope correctly with UTF-8 encoded characters such as 'Sign of Leser-Tr\xc3\xa9lat'. Assigning this data to an element, an attribute or using it in a constructor all result in the same error.
We can proposed a fix that takes care of the first case:
instance = bar()
instance.e = 'Sign of Leser-Tr\xc3\xa9lat'
But have no solutions to the other two situations.
(The proposed fix is in basis.py at line ~286
if str == value_type:
A test case has been attached.
Three test cases that reproduce element, attribute and constructor failures
revised demonstration
Thanks for the detailed test case. After reviewing things, though, I'm
going to rule this non-discrepant. The error is in the test, which is using
a non-Unicode string in a context where a Unicode string is required.
Python and PyXB both accept Unicode as the natural representation of XML
text. I believe the correct solution is to add a coding declaration to the
script file, and to use a Unicode string, as:
so that the string is correctly interpreted by Python in all contexts.
(Note that the above does not produce the same string as your proposed
patch; I don't know which is correct in your case.)
The ticket did let me investigate the behavior of the underlying XML
infrastructure that PyXB depends on with respect to document processing.
PyXB supports three parsing mechanisms: minidom, saxdom, and saxer, with
saxer being the default. The saxdom style is capable of processing the
unicode string without conversion. minicom and saxer use expat underneath,
and expat only works on non-unicode strings. In that situation, whatever's
passed to CreateFromDocument must be converted to a binary string first.
See http://www.evanjones.ca/python-utf8.html,
http://bytes.com/topic/python/answers/41153-xml-unicode-what-am-i-doing-wrong,
and the attached revised test case.
Alternatively, use the version below. The point is that PyXB does not know what encoding was used for your byte string text. For it to assume utf-8 internally would be wrong. I believe assuming that the encoding of the schema restricts the encoding of documents conforming the schema would also be wrong. So your application has to do it.
{{
class TestTrac_utf8 (unittest.TestCase):
basis.py line ~286
if str == value_type:
value = unicode(value, 'utf-8') <-----
value_type = unicode
}}