PyXB: Python XML Schema Bindings / Tickets / #131 UTF-8 Characters don't store correctly

Harold Solbrig - 2012-03-20

attachment _test-201103171200.py_ added

Three test cases that reproduce element, attribute and constructor failures
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-03-21

attachment _test-trac-0131.py_ added

revised demonstration
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-03-21

status changed from new to closed

resolution set to non-discrepant

Thanks for the detailed test case. After reviewing things, though, I'm
going to rule this non-discrepant. The error is in the test, which is using
a non-Unicode string in a context where a Unicode string is required.

Python and PyXB both accept Unicode as the natural representation of XML
text. I believe the correct solution is to add a coding declaration to the
script file, and to use a Unicode string, as:

# -*- coding: utf-8 -*- testdata = u'Sign of Leser-Tr\xc3\xa9lat'

so that the string is correctly interpreted by Python in all contexts.
(Note that the above does not produce the same string as your proposed
patch; I don't know which is correct in your case.)

The ticket did let me investigate the behavior of the underlying XML
infrastructure that PyXB depends on with respect to document processing.
PyXB supports three parsing mechanisms: minidom, saxdom, and saxer, with
saxer being the default. The saxdom style is capable of processing the
unicode string without conversion. minicom and saxer use expat underneath,
and expat only works on non-unicode strings. In that situation, whatever's
passed to CreateFromDocument must be converted to a binary string first.

See http://www.evanjones.ca/python-utf8.html,
http://bytes.com/topic/python/answers/41153-xml-unicode-what-am-i-doing-wrong,
and the attached revised test case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-03-21

Alternatively, use the version below. The point is that PyXB does not know what encoding was used for your byte string text. For it to assume utf-8 internally would be wrong. I believe assuming that the encoding of the schema restricts the encoding of documents conforming the schema would also be wrong. So your application has to do it.

{{
class TestTrac_utf8 (unittest.TestCase):

texts = 'Sign of Leser-Tr\xc3\xa9lat'
textu = unicode(texts, 'utf-8')

def testElementEncode (self):

This test can be corrected by:

basis.py line ~286

if str == value_type:

value = unicode(value, 'utf-8') <-----

value_type = unicode

instance = bar()
instance.e = self.textu
self.assertEqual(instance.e.encode('utf-8'), self.texts)

def testAttributeEncode (self):

Fix not known for this case

instance = bar()
instance.a = self.textu
self.assertEqual(instance.a.encode('utf-8'), self.texts)

def testDataEncode (self):

Fix not known for this case

instance = foo(self.textu)
self.assertEqual(instance.encode('utf-8'), self.texts)

}}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 Characters don't store correctly

Milestone

Searches

Help

#131 UTF-8 Characters don't store correctly

if str == value_type:

value = unicode(value, 'utf-8') <-----

value_type = unicode

Discussion

This test can be corrected by:

basis.py line ~286

if str == value_type:

value = unicode(value, 'utf-8') <-----

value_type = unicode

Fix not known for this case

Fix not known for this case