PyXB: Python XML Schema Bindings / Discussion / Help: Class names with Unicode

This is perhaps a newbie question, but I would really appreciate it if somebody points me to the right direction,

I'm unable to generate binding classes with PyXB when element names are non ASCII?

The minimal reproducible example:

<?xml version="1.0" encoding="utf8"?>
<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Address">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Country" type="xs:string" />
        <xs:element name="Street" type="xs:string" />
        <xs:element name="Town" type="xs:string" />       
        <xs:element name="Дом" type="xs:string" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

(look for the <xs:element name="Дом" type="xs:string" /> where I use cyrillic. The encoding of the file is utf8. However, when I try:

pyxbgen -u example.xsd -m example

I got the error:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/lib/python3.5/xml/sax/expatreader.py", line 210, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 9, column 26

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/bin/pyxbgen", line 52, in <module>
    generator.resolveExternalSchema()
.......

What is the right way to approach element names in unicode?
I'm using Python 3.5.

(I have put a similar question at http://stackoverflow.com/questions/40428247/pyxb-generating-classes-with-unicode but unfortunately I am not getting any response there).

Thanks in advance!

Update

When I change encoding to cp1251 I got:

WARNING:pyxb.binding.generate:Element use None.Дом renamed to emptyString
Python for AbsentNamespace0 requires 1 modules

and I am able to generate scheme like:

'<?xml version="1.0" ?><Address><Country>US</Country><Town>City</Town><Street>Main Street</Street><Дом>13</Дом></Address>'

by assigning to emptyString.
Is there a way to keep original name Дом ?

Last edit: Sergey Bushmanov 2016-11-05

Peter A. Bigot - 2016-11-04

As noted in my response on stackoverflow (which I don't monitor regularly), utf8 is spelled utf-8, and at this time there is no fix to support non-ASCII identifiers. It's tracked as issue 67 and will probably be fixed for PyXB 1.2.6 but that's likely to be sometime next year.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unidecode seem to work if I further simplify pyxbgen_jp :

from unidecode import unidecode
def ConvertRuIdentifier (identifier):
    identifier = six.text_type(identifier)
    identifier = unidecode(identifier)
    return identifier
pyxb.utils.utility._SetXMLIdentifierToPython(ConvertRuIdentifier)

Then, I am able to:

!pyxbgen_ru -u example.xsd -m example
Element use None.Дом renamed to Dom
Python for AbsentNamespace0 requires 1 modules

import example as e
address = e.Address()
address.Country = "US"
address.Dom = "13"

from lxml import etree
print(etree
      .tostring(etree.fromstring(address.toxml(encoding="utf-8"))
      ,encoding="utf-8",pretty_print=True)
      .decode("utf-8"))

with output:

<Address>
  <Country>US</Country>
  <Дом>13</Дом>
</Address>

Last edit: Sergey Bushmanov 2016-11-05

Peter A. Bigot - 2016-11-05

Thanks; I've updated the issue to note that approach should be considered.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Bushmanov - 2016-11-05

Thank YOU man, your package really helps!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Class names with Unicode

Forums

Help

Class names with Unicode document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Class names with Unicode