#131 UTF-8 Characters don't store correctly

PyXB 1.1.4
closed
non-discrepant
Binding model
minor
PyXB 1.1.3
defect
2012-03-21
2012-03-20
No

The pyxb API doesn't cope correctly with UTF-8 encoded characters such as 'Sign of Leser-Tr\xc3\xa9lat'. Assigning this data to an element, an attribute or using it in a constructor all result in the same error.

We can proposed a fix that takes care of the first case:

instance = bar()
instance.e = 'Sign of Leser-Tr\xc3\xa9lat'

But have no solutions to the other two situations.

(The proposed fix is in basis.py at line ~286

if str == value_type:

value = unicode(value, 'utf-8') <-----

value_type = unicode

A test case has been attached.

2 Attachments

Discussion

  • Harold Solbrig

    Harold Solbrig - 2012-03-20

    Three test cases that reproduce element, attribute and constructor failures

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-03-21

    revised demonstration

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-03-21
    • status changed from new to closed
    • resolution set to non-discrepant

    Thanks for the detailed test case. After reviewing things, though, I'm
    going to rule this non-discrepant. The error is in the test, which is using
    a non-Unicode string in a context where a Unicode string is required.

    Python and PyXB both accept Unicode as the natural representation of XML
    text. I believe the correct solution is to add a coding declaration to the
    script file, and to use a Unicode string, as:

    # -*- coding: utf-8 -*-
    
    testdata = u'Sign of Leser-Tr\xc3\xa9lat'
    

    so that the string is correctly interpreted by Python in all contexts.
    (Note that the above does not produce the same string as your proposed
    patch; I don't know which is correct in your case.)

    The ticket did let me investigate the behavior of the underlying XML
    infrastructure that PyXB depends on with respect to document processing.
    PyXB supports three parsing mechanisms: minidom, saxdom, and saxer, with
    saxer being the default. The saxdom style is capable of processing the
    unicode string without conversion. minicom and saxer use expat underneath,
    and expat only works on non-unicode strings. In that situation, whatever's
    passed to CreateFromDocument must be converted to a binary string first.

    See http://www.evanjones.ca/python-utf8.html,
    http://bytes.com/topic/python/answers/41153-xml-unicode-what-am-i-doing-wrong,
    and the attached revised test case.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-03-21

    Alternatively, use the version below. The point is that PyXB does not know what encoding was used for your byte string text. For it to assume utf-8 internally would be wrong. I believe assuming that the encoding of the schema restricts the encoding of documents conforming the schema would also be wrong. So your application has to do it.

    {{
    class TestTrac_utf8 (unittest.TestCase):

    texts = 'Sign of Leser-Tr\xc3\xa9lat'
    textu = unicode(texts, 'utf-8')

    def testElementEncode (self):

    This test can be corrected by:

    basis.py line ~286

    if str == value_type:

    value = unicode(value, 'utf-8') <-----

    value_type = unicode

    instance = bar()
    instance.e = self.textu
    self.assertEqual(instance.e.encode('utf-8'), self.texts)

    def testAttributeEncode (self):

    Fix not known for this case

    instance = bar()
    instance.a = self.textu
    self.assertEqual(instance.a.encode('utf-8'), self.texts)

    def testDataEncode (self):

    Fix not known for this case

    instance = foo(self.textu)
    self.assertEqual(instance.encode('utf-8'), self.texts)

    }}

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks