PyXB: Python XML Schema Bindings / Tickets / #139 alternative XML parsers like lxml (-> XMLStyle

After trial and error, I noticed that collaboration between libxml2 and xml.dom.minidom
is good, at least for my encoding problem.

Like this:

with open(filename, "rb") as fi:
    try: # try using minidom with libxml2 SAX
        import libxml2
        import xml.sax
        import xml.dom.minidom
        return xml.dom.minidom.parseString(fi.read(), xml.sax.make_parser(["drv_libxml2"]))
    except ImportError:
        try: # try using lxml
            import xml.dom.pulldom
            import lxml.etree, lxml.sax
            tree = lxml.etree.fromstring(fi.read())
            handler = xml.dom.pulldom.SAX2DOM()
            lxml.sax.saxify(tree, handler)
            handler.documentElement = handler.document
            return handler.documentElement
        except ImportError:
            # using pyXB utils
            import pyxb.utils.domutils as domutils
            doc = domutils.StringToDOM(fi.read())
            return doc.documentElement

So, I guess that better approach is to add control SAX2 reader
for pyxb._XMLStyle == pyxb.XMLStyle_minidom.

Peter A. Bigot - 2012-06-13

status changed from new to accepted

Could you attach to this ticket a simple schema file with the encodings you want supported? It'd help make sure the solution I'm looking at works.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-13

attachment _schemas.zip_ added
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-13

attached.
original are written in Shift_JIS', and we need alsoeuc-jp', `iso-2022-jp'.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-13

Thanks. Do you need to create the euc-jp and iso-2022-jp encodings, or was that a side effect of trying to work around the pyxb/xml/expat stuff?

I've verified a patch that works with:

export PYXB_ARCHIVE_PATH='&pyxb/bundles/opengis//:+' pyxbgen -u original/FGD_GMLSchema.xsd -m fgd python check.py

where check.py has:

import fgd xmls = file('original/FG-GML-13-RailCL25000-20080331-0001.xml').read() instance = fgd.CreateFromDocument(xmls)

And where "works" means it builds the schema and processes the document without error. I don't know either Japanese or GML well enough to tell whether the instance is valid.

(Essentially the only thing needed is to specify 'drv_libxml2' as the preferred parser; you don't even need to use minidom as the style, because saxer works fine. Unless you need to create DOM for some reason.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-13

A question and a comment:

- Do you mind if that schema and document are added to the test suite?

- Sorry, but as of 1.1.4 the OpenGIS bindings will not be part of the full download. The bundle will still be there, but you'll have to run the script that fetches the schema and translates them yourself. I updated the bundle this morning, and OpenGIS now takes 50MB, about 10x more than the rest of PyXB. If that's going to be a big problem, I can see if I can provide them as an add-on when #130 gets addressed for 1.1.5.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-14

Not as much as Japanese launguage for you, English is difficult for me,
so I didn't understand your comment well.
(I can't speak it at all.)

Do you need to create the euc-jp and iso-2022-jp encodings, or was that
a side effect of trying to work around the pyxb/xml/expat stuff?

I can't understand this sentence well...
`create the ... encodings' means like this?

# -*- coding: iso-2022-jp -*- # ./raw/fgd.py # ...

Assuming that you meant it, I don't need it. I think if pyxbgen will generate
to utf-8 always, no problem for us, maybe.
(That is disconnect between ideal(utf-8) and reality(Shift_JIS)...)

Do you mind if that schema and document are added to the test suite?

I can't also understand this well, ...sorry.

If you mean `Do you have test suite for this issue?', the answer is no at all.
Actually, my current mission is to investigate and evaluate tecknologies,
infrastructure for GIS world (like GDAL, geos, Shapely, pyproj, postgis etc, etc.)
for future project for us, so, I can't spend time well to each techs, infras.

the OpenGIS bindings will not be part of the full download

No problem for me. GML is too heavy to bundle, I think so.
To mention how to build those in this project web page is enough.
(Even if not, we don't feel bad, maybe.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-14

BTW, it seems that `utf8' is wrong. see http://docs.python.org/library/pyexpat.html?highlight=utf8

./pyxb/binding/generate.py
./build/lib/pyxb/binding/generate.py

It should be utf-8', notutf8', though Python can treat it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-14

I will try again.

Encodings: All the encodings should produce the same Unicode once the data is in Python. So you do not need to convert the schema to other encodings. Leave it in Shift_JIS. PyXB will generate bindings where utf-8 is used. Source documents can be in any encoding as long as the XML parser can convert them to "real" Unicode internally before PyXB processes them. When PyXB supports Python3, the encoding from the schema can be used in the bindings, and the Unicode identifiers from the schema can be used too. That will be done in PyXB version 2.0.

Test suite: I would like to add the schema and document you posted to the test suite. Some people do not want the files they provide to be given away to others like this, for security or intellectual property reasons. Anybody can see the files here on the trac site, but more people will see them if they are put into PyXB's tests directory. May I add them to PyXB?

I have made many changes in the last two days improving Unicode support. Yes, utf8 was wrong and is now utf-8. If you can use git, try:

git clone -b next git://pyxb.git.sourceforge.net/gitroot/pyxb/pyxb

to see what has changed.

I hope to have a solution to this problem and #141 later today, and it will be supported in PyXB 1.1.5 to be released tomorrow.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-14

I will try again.

Thanks.

All the encodings should produce the same Unicode once the data is in Python.
So you do not need to convert the schema to other encodings.
Leave it in Shift_JIS.
PyXB will generate bindings where utf-8 is used.

yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.

Source documents can be in any encoding as long as the XML parser can convert them
to "real" Unicode internally before PyXB processes them.

yes. so i needed to replace the parser.

When PyXB supports Python3, the encoding from the schema can be used in the bindings,
and the Unicode identifiers from the schema can be used too. That will be done in PyXB
version 2.0.

I don't want it necessarily, but we're happier if it were.
I'll waiting in anticipation.

I would like to add the schema and document you posted to the test suite.

Ahh...I see. That schema and data is not mine, so I've looked for
liscense for distribution, then no problem is found.
(Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)

May I add them to PyXB?

No problem, I think.

I have made many changes in the last two days improving Unicode support.
Yes, utf8 was wrong and is now utf-8.

I saw. Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-14

Replying to hhsprings:

PyXB will generate bindings where utf-8 is used.

yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.

Python2 does not allow non-ASCII identifiers. The only parts that could be left in Shift_JIS would be enumeration values and other strings. It is difficult to find out the encoding of the schema, so for now the bindings will be encoded in utf-8. People using the bindings in their own Python scripts can use any compatible encoding in their scripts.

When using Python3, I think PyXB should allow the encoding from the schema to be used, so the identifiers do not change.

Source documents can be in any encoding as long as the XML parser can convert them
to "real" Unicode internally before PyXB processes them.

yes. so i needed to replace the parser.

There will be a clean way to replace the parser in a patch I will send in a few hours.

When PyXB supports Python3, the encoding from the schema can be used in the bindings,
and the Unicode identifiers from the schema can be used too. That will be done in PyXB
version 2.0.

I don't want it necessarily, but we're happier if it were.
I'll waiting in anticipation.

I would like to add the schema and document you posted to the test suite.

Ahh...I see. That schema and data is not mine, so I've looked for
liscense for distribution, then no problem is found.
(Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)

That makes it much more clear ;-)

May I add them to PyXB?

No problem, I think.

Good. That example will show you how to solve this problem and #141. I will update this ticket when the example is ready.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-14

status changed from accepted to closed

resolution set to fixed

Fixed in the following commit. The unicode_jp example shows how to use it.

The "remaining issue" is that the solution doesn't work when parsing documents that have been built up in memory; see #147. I expect to fix this soon, but it'll take too long and I promised to have 1.1.4 out tomorrow.

If you can, please checkout the next branch from git and see whether the example works. I hope you'll be pleased with it; I think it's really neat, especially that you can use shift_jis in the Python code that interacts with the bindings.

Thank you for the schema and the suggestions that led to this. I hope the need for a customized "pyxbgen_jp" isn't a problem; the one in the example should do what you want.

commit 9b48a3122c5d8bcd38ebcdfbc614da19eeade530
Author: Peter A. Bigot <pabigot@…>
Date: Thu Jun 14 12:18:17 2012 -0500

trac/139: support alternative XML parsers

For the purpose of solving this problem, it is sufficient to use an
alternative XmlReader; it is not necessary to support XMLStyle_lxml which
would support lxml in the DOM domain. A crude but usable interface has been
added to configure alternatives.

Note that there is a remaining issue which has been opened as trac/147.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hhsprings - 2012-06-15

There is one point.
Schema, and data is available from http://fgd.gsi.go.jp/download/,
not http://www.gsi.go.jp/LAW/2930-index.html, and
http://www.gsi.go.jp/LAW/2930-index.html mentions legal issues
for distributions.

Thank you for your excellent work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2012-06-15

Replying to hhsprings:

Schema, and data is available from http://fgd.gsi.go.jp/download/,
not http://www.gsi.go.jp/LAW/2930-index.html

Thanks; I'll update the readme.

Thank you for your excellent work.

You're welcome.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

alternative XML parsers like lxml (-> XMLStyle_lxml)

Milestone

Searches

Help

#139 alternative XML parsers like lxml (-> XMLStyle_lxml)

Discussion