#139 alternative XML parsers like lxml (-> XMLStyle_lxml)

PyXB 1.1.4
closed
None
fixed
Code
major
PyXB 1.1.3
enhancement
2012-06-15
2012-06-11
hhsprings
No

For encodings that expat parsers can't process like `iso-2022-jp', we need to replace parser to alternative like lxml.

1 Attachments

Discussion

  • hhsprings

    hhsprings - 2012-06-13

    After trial and error, I noticed that collaboration between libxml2 and xml.dom.minidom
    is good, at least for my encoding problem.

    Like this:

    with open(filename, "rb") as fi:
        try: # try using minidom with libxml2 SAX
            import libxml2
            import xml.sax
            import xml.dom.minidom
            return xml.dom.minidom.parseString(fi.read(), xml.sax.make_parser(["drv_libxml2"]))
        except ImportError:
            try: # try using lxml
                import xml.dom.pulldom
                import lxml.etree, lxml.sax
                tree = lxml.etree.fromstring(fi.read())
                handler = xml.dom.pulldom.SAX2DOM()
                lxml.sax.saxify(tree, handler)
                handler.documentElement = handler.document
                return handler.documentElement
            except ImportError:
                # using pyXB utils
                import pyxb.utils.domutils as domutils
                doc = domutils.StringToDOM(fi.read())
                return doc.documentElement
    

    So, I guess that better approach is to add control SAX2 reader
    for pyxb._XMLStyle == pyxb.XMLStyle_minidom.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-13
    • status changed from new to accepted

    Could you attach to this ticket a simple schema file with the encodings you want supported? It'd help make sure the solution I'm looking at works.

     
  • hhsprings

    hhsprings - 2012-06-13
     
  • hhsprings

    hhsprings - 2012-06-13

    attached.
    original are written in Shift_JIS', and we need alsoeuc-jp', `iso-2022-jp'.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-13

    Thanks. Do you need to create the euc-jp and iso-2022-jp encodings, or was that a side effect of trying to work around the pyxb/xml/expat stuff?

    I've verified a patch that works with:

    export PYXB_ARCHIVE_PATH='&pyxb/bundles/opengis//:+'
    pyxbgen -u original/FGD_GMLSchema.xsd -m fgd
    python check.py
    

    where check.py has:

    import fgd
    
    xmls = file('original/FG-GML-13-RailCL25000-20080331-0001.xml').read()
    instance = fgd.CreateFromDocument(xmls)
    

    And where "works" means it builds the schema and processes the document without error. I don't know either Japanese or GML well enough to tell whether the instance is valid.

    (Essentially the only thing needed is to specify 'drv_libxml2' as the preferred parser; you don't even need to use minidom as the style, because saxer works fine. Unless you need to create DOM for some reason.)

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-13

    A question and a comment:

    - Do you mind if that schema and document are added to the test suite?

    - Sorry, but as of 1.1.4 the OpenGIS bindings will not be part of the full download. The bundle will still be there, but you'll have to run the script that fetches the schema and translates them yourself. I updated the bundle this morning, and OpenGIS now takes 50MB, about 10x more than the rest of PyXB. If that's going to be a big problem, I can see if I can provide them as an add-on when #130 gets addressed for 1.1.5.

     
  • hhsprings

    hhsprings - 2012-06-14

    Not as much as Japanese launguage for you, English is difficult for me,
    so I didn't understand your comment well.
    (I can't speak it at all.)

    Do you need to create the euc-jp and iso-2022-jp encodings, or was that
    a side effect of trying to work around the pyxb/xml/expat stuff?

    I can't understand this sentence well...
    `create the ... encodings' means like this?

    # -*- coding: iso-2022-jp -*-
    # ./raw/fgd.py
    
    # ...
    

    Assuming that you meant it, I don't need it. I think if pyxbgen will generate
    to utf-8 always, no problem for us, maybe.
    (That is disconnect between ideal(utf-8) and reality(Shift_JIS)...)

    Do you mind if that schema and document are added to the test suite?

    I can't also understand this well, ...sorry.

    If you mean `Do you have test suite for this issue?', the answer is no at all.
    Actually, my current mission is to investigate and evaluate tecknologies,
    infrastructure for GIS world (like GDAL, geos, Shapely, pyproj, postgis etc, etc.)
    for future project for us, so, I can't spend time well to each techs, infras.

    the OpenGIS bindings will not be part of the full download

    No problem for me. GML is too heavy to bundle, I think so.
    To mention how to build those in this project web page is enough.
    (Even if not, we don't feel bad, maybe.)

     
  • hhsprings

    hhsprings - 2012-06-14

    BTW, it seems that `utf8' is wrong. see http://docs.python.org/library/pyexpat.html?highlight=utf8

    ./pyxb/binding/generate.py
    ./build/lib/pyxb/binding/generate.py

    It should be utf-8', notutf8', though Python can treat it.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-14

    I will try again.

    Encodings: All the encodings should produce the same Unicode once the data is in Python. So you do not need to convert the schema to other encodings. Leave it in Shift_JIS. PyXB will generate bindings where utf-8 is used. Source documents can be in any encoding as long as the XML parser can convert them to "real" Unicode internally before PyXB processes them. When PyXB supports Python3, the encoding from the schema can be used in the bindings, and the Unicode identifiers from the schema can be used too. That will be done in PyXB version 2.0.

    Test suite: I would like to add the schema and document you posted to the test suite. Some people do not want the files they provide to be given away to others like this, for security or intellectual property reasons. Anybody can see the files here on the trac site, but more people will see them if they are put into PyXB's tests directory. May I add them to PyXB?

    I have made many changes in the last two days improving Unicode support. Yes, utf8 was wrong and is now utf-8. If you can use git, try:

      git clone -b next git://pyxb.git.sourceforge.net/gitroot/pyxb/pyxb
    

    to see what has changed.

    I hope to have a solution to this problem and #141 later today, and it will be supported in PyXB 1.1.5 to be released tomorrow.

     
  • hhsprings

    hhsprings - 2012-06-14

    I will try again.

    Thanks.

    All the encodings should produce the same Unicode once the data is in Python.
    So you do not need to convert the schema to other encodings.
    Leave it in Shift_JIS.
    PyXB will generate bindings where utf-8 is used.

    yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.

    Source documents can be in any encoding as long as the XML parser can convert them
    to "real" Unicode internally before PyXB processes them.

    yes. so i needed to replace the parser.

    When PyXB supports Python3, the encoding from the schema can be used in the bindings,
    and the Unicode identifiers from the schema can be used too. That will be done in PyXB
    version 2.0.

    I don't want it necessarily, but we're happier if it were.
    I'll waiting in anticipation.

    I would like to add the schema and document you posted to the test suite.

    Ahh...I see. That schema and data is not mine, so I've looked for
    liscense for distribution, then no problem is found.
    (Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)

    May I add them to PyXB?

    No problem, I think.

    I have made many changes in the last two days improving Unicode support.
    Yes, utf8 was wrong and is now utf-8.

    I saw. Thanks.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-14

    Replying to hhsprings:

    PyXB will generate bindings where utf-8 is used.

    yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.

    Python2 does not allow non-ASCII identifiers. The only parts that could be left in Shift_JIS would be enumeration values and other strings. It is difficult to find out the encoding of the schema, so for now the bindings will be encoded in utf-8. People using the bindings in their own Python scripts can use any compatible encoding in their scripts.

    When using Python3, I think PyXB should allow the encoding from the schema to be used, so the identifiers do not change.

    Source documents can be in any encoding as long as the XML parser can convert them
    to "real" Unicode internally before PyXB processes them.

    yes. so i needed to replace the parser.

    There will be a clean way to replace the parser in a patch I will send in a few hours.

    When PyXB supports Python3, the encoding from the schema can be used in the bindings,
    and the Unicode identifiers from the schema can be used too. That will be done in PyXB
    version 2.0.

    I don't want it necessarily, but we're happier if it were.
    I'll waiting in anticipation.

    I would like to add the schema and document you posted to the test suite.

    Ahh...I see. That schema and data is not mine, so I've looked for
    liscense for distribution, then no problem is found.
    (Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)

    That makes it much more clear ;-)

    May I add them to PyXB?

    No problem, I think.

    Good. That example will show you how to solve this problem and #141. I will update this ticket when the example is ready.

     
  • Peter A. Bigot

    Peter A. Bigot - 2012-06-14
    • status changed from accepted to closed
    • resolution set to fixed

    Fixed in the following commit. The unicode_jp example shows how to use it.

    The "remaining issue" is that the solution doesn't work when parsing documents that have been built up in memory; see #147. I expect to fix this soon, but it'll take too long and I promised to have 1.1.4 out tomorrow.

    If you can, please checkout the next branch from git and see whether the example works. I hope you'll be pleased with it; I think it's really neat, especially that you can use shift_jis in the Python code that interacts with the bindings.

    Thank you for the schema and the suggestions that led to this. I hope the need for a customized "pyxbgen_jp" isn't a problem; the one in the example should do what you want.

    commit 9b48a3122c5d8bcd38ebcdfbc614da19eeade530
    Author: Peter A. Bigot <pabigot@…>
    Date: Thu Jun 14 12:18:17 2012 -0500

    trac/139: support alternative XML parsers

    For the purpose of solving this problem, it is sufficient to use an
    alternative XmlReader; it is not necessary to support XMLStyle_lxml which
    would support lxml in the DOM domain. A crude but usable interface has been
    added to configure alternatives.

    Note that there is a remaining issue which has been opened as trac/147.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks