After trial and error, I noticed that collaboration between libxml2 and xml.dom.minidom
is good, at least for my encoding problem.
Like this:
withopen(filename,"rb")asfi:try:# try using minidom with libxml2 SAXimportlibxml2importxml.saximportxml.dom.minidomreturnxml.dom.minidom.parseString(fi.read(),xml.sax.make_parser(["drv_libxml2"]))exceptImportError:try:# try using lxmlimportxml.dom.pulldomimportlxml.etree,lxml.saxtree=lxml.etree.fromstring(fi.read())handler=xml.dom.pulldom.SAX2DOM()lxml.sax.saxify(tree,handler)handler.documentElement=handler.documentreturnhandler.documentElementexceptImportError:# using pyXB utilsimportpyxb.utils.domutilsasdomutilsdoc=domutils.StringToDOM(fi.read())returndoc.documentElement
So, I guess that better approach is to add control SAX2 reader
for pyxb._XMLStyle == pyxb.XMLStyle_minidom.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And where "works" means it builds the schema and processes the document without error. I don't know either Japanese or GML well enough to tell whether the instance is valid.
(Essentially the only thing needed is to specify 'drv_libxml2' as the preferred parser; you don't even need to use minidom as the style, because saxer works fine. Unless you need to create DOM for some reason.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Do you mind if that schema and document are added to the test suite?
- Sorry, but as of 1.1.4 the OpenGIS bindings will not be part of the full download. The bundle will still be there, but you'll have to run the script that fetches the schema and translates them yourself. I updated the bundle this morning, and OpenGIS now takes 50MB, about 10x more than the rest of PyXB. If that's going to be a big problem, I can see if I can provide them as an add-on when #130 gets addressed for 1.1.5.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Assuming that you meant it, I don't need it. I think if pyxbgen will generate
to utf-8 always, no problem for us, maybe.
(That is disconnect between ideal(utf-8) and reality(Shift_JIS)...)
Do you mind if that schema and document are added to the test suite?
I can't also understand this well, ...sorry.
If you mean `Do you have test suite for this issue?', the answer is no at all.
Actually, my current mission is to investigate and evaluate tecknologies,
infrastructure for GIS world (like GDAL, geos, Shapely, pyproj, postgis etc, etc.)
for future project for us, so, I can't spend time well to each techs, infras.
the OpenGIS bindings will not be part of the full download
No problem for me. GML is too heavy to bundle, I think so.
To mention how to build those in this project web page is enough.
(Even if not, we don't feel bad, maybe.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Encodings: All the encodings should produce the same Unicode once the data is in Python. So you do not need to convert the schema to other encodings. Leave it in Shift_JIS. PyXB will generate bindings where utf-8 is used. Source documents can be in any encoding as long as the XML parser can convert them to "real" Unicode internally before PyXB processes them. When PyXB supports Python3, the encoding from the schema can be used in the bindings, and the Unicode identifiers from the schema can be used too. That will be done in PyXB version 2.0.
Test suite: I would like to add the schema and document you posted to the test suite. Some people do not want the files they provide to be given away to others like this, for security or intellectual property reasons. Anybody can see the files here on the trac site, but more people will see them if they are put into PyXB's tests directory. May I add them to PyXB?
I have made many changes in the last two days improving Unicode support. Yes, utf8 was wrong and is now utf-8. If you can use git, try:
All the encodings should produce the same Unicode once the data is in Python.
So you do not need to convert the schema to other encodings.
Leave it in Shift_JIS.
PyXB will generate bindings where utf-8 is used.
yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.
Source documents can be in any encoding as long as the XML parser can convert them
to "real" Unicode internally before PyXB processes them.
yes. so i needed to replace the parser.
When PyXB supports Python3, the encoding from the schema can be used in the bindings,
and the Unicode identifiers from the schema can be used too. That will be done in PyXB
version 2.0.
I don't want it necessarily, but we're happier if it were.
I'll waiting in anticipation.
I would like to add the schema and document you posted to the test suite.
Ahh...I see. That schema and data is not mine, so I've looked for
liscense for distribution, then no problem is found.
(Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)
May I add them to PyXB?
No problem, I think.
I have made many changes in the last two days improving Unicode support.
Yes, utf8 was wrong and is now utf-8.
I saw. Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.
Python2 does not allow non-ASCII identifiers. The only parts that could be left in Shift_JIS would be enumeration values and other strings. It is difficult to find out the encoding of the schema, so for now the bindings will be encoded in utf-8. People using the bindings in their own Python scripts can use any compatible encoding in their scripts.
When using Python3, I think PyXB should allow the encoding from the schema to be used, so the identifiers do not change.
Source documents can be in any encoding as long as the XML parser can convert them
to "real" Unicode internally before PyXB processes them.
yes. so i needed to replace the parser.
There will be a clean way to replace the parser in a patch I will send in a few hours.
When PyXB supports Python3, the encoding from the schema can be used in the bindings,
and the Unicode identifiers from the schema can be used too. That will be done in PyXB
version 2.0.
I don't want it necessarily, but we're happier if it were.
I'll waiting in anticipation.
I would like to add the schema and document you posted to the test suite.
Ahh...I see. That schema and data is not mine, so I've looked for
liscense for distribution, then no problem is found.
(Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)
That makes it much more clear ;-)
May I add them to PyXB?
No problem, I think.
Good. That example will show you how to solve this problem and #141. I will update this ticket when the example is ready.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Fixed in the following commit. The unicode_jp example shows how to use it.
The "remaining issue" is that the solution doesn't work when parsing documents that have been built up in memory; see #147. I expect to fix this soon, but it'll take too long and I promised to have 1.1.4 out tomorrow.
If you can, please checkout the next branch from git and see whether the example works. I hope you'll be pleased with it; I think it's really neat, especially that you can use shift_jis in the Python code that interacts with the bindings.
Thank you for the schema and the suggestions that led to this. I hope the need for a customized "pyxbgen_jp" isn't a problem; the one in the example should do what you want.
commit 9b48a3122c5d8bcd38ebcdfbc614da19eeade530
Author: Peter A. Bigot <pabigot@…>
Date: Thu Jun 14 12:18:17 2012 -0500
trac/139: support alternative XML parsers
For the purpose of solving this problem, it is sufficient to use an
alternative XmlReader; it is not necessary to support XMLStyle_lxml which
would support lxml in the DOM domain. A crude but usable interface has been
added to configure alternatives.
Note that there is a remaining issue which has been opened as trac/147.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After trial and error, I noticed that collaboration between libxml2 and xml.dom.minidom
is good, at least for my encoding problem.
Like this:
So, I guess that better approach is to add control SAX2 reader
for pyxb._XMLStyle == pyxb.XMLStyle_minidom.
Could you attach to this ticket a simple schema file with the encodings you want supported? It'd help make sure the solution I'm looking at works.
attached.
original are written in
Shift_JIS', and we need alsoeuc-jp', `iso-2022-jp'.Thanks. Do you need to create the euc-jp and iso-2022-jp encodings, or was that a side effect of trying to work around the pyxb/xml/expat stuff?
I've verified a patch that works with:
where check.py has:
And where "works" means it builds the schema and processes the document without error. I don't know either Japanese or GML well enough to tell whether the instance is valid.
(Essentially the only thing needed is to specify 'drv_libxml2' as the preferred parser; you don't even need to use minidom as the style, because saxer works fine. Unless you need to create DOM for some reason.)
A question and a comment:
- Do you mind if that schema and document are added to the test suite?
- Sorry, but as of 1.1.4 the OpenGIS bindings will not be part of the full download. The bundle will still be there, but you'll have to run the script that fetches the schema and translates them yourself. I updated the bundle this morning, and OpenGIS now takes 50MB, about 10x more than the rest of PyXB. If that's going to be a big problem, I can see if I can provide them as an add-on when #130 gets addressed for 1.1.5.
Not as much as Japanese launguage for you, English is difficult for me,
so I didn't understand your comment well.
(I can't speak it at all.)
I can't understand this sentence well...
`create the ... encodings' means like this?
Assuming that you meant it, I don't need it. I think if pyxbgen will generate
to utf-8 always, no problem for us, maybe.
(That is disconnect between ideal(utf-8) and reality(Shift_JIS)...)
I can't also understand this well, ...sorry.
If you mean `Do you have test suite for this issue?', the answer is no at all.
Actually, my current mission is to investigate and evaluate tecknologies,
infrastructure for GIS world (like GDAL, geos, Shapely, pyproj, postgis etc, etc.)
for future project for us, so, I can't spend time well to each techs, infras.
No problem for me. GML is too heavy to bundle, I think so.
To mention how to build those in this project web page is enough.
(Even if not, we don't feel bad, maybe.)
BTW, it seems that `utf8' is wrong. see http://docs.python.org/library/pyexpat.html?highlight=utf8
It should be
utf-8', notutf8', though Python can treat it.I will try again.
Encodings: All the encodings should produce the same Unicode once the data is in Python. So you do not need to convert the schema to other encodings. Leave it in Shift_JIS. PyXB will generate bindings where utf-8 is used. Source documents can be in any encoding as long as the XML parser can convert them to "real" Unicode internally before PyXB processes them. When PyXB supports Python3, the encoding from the schema can be used in the bindings, and the Unicode identifiers from the schema can be used too. That will be done in PyXB version 2.0.
Test suite: I would like to add the schema and document you posted to the test suite. Some people do not want the files they provide to be given away to others like this, for security or intellectual property reasons. Anybody can see the files here on the trac site, but more people will see them if they are put into PyXB's tests directory. May I add them to PyXB?
I have made many changes in the last two days improving Unicode support. Yes, utf8 was wrong and is now utf-8. If you can use git, try:
to see what has changed.
I hope to have a solution to this problem and #141 later today, and it will be supported in PyXB 1.1.5 to be released tomorrow.
Thanks.
yes, that's what I need. (someone may complain, but IMO we should use utf-8 as far as possible.
yes. so i needed to replace the parser.
I don't want it necessarily, but we're happier if it were.
I'll waiting in anticipation.
Ahh...I see. That schema and data is not mine, so I've looked for
liscense for distribution, then no problem is found.
(Sorry in Japanese, http://www.gsi.go.jp/LAW/2930-index.html)
No problem, I think.
I saw. Thanks.
Replying to hhsprings:
Python2 does not allow non-ASCII identifiers. The only parts that could be left in Shift_JIS would be enumeration values and other strings. It is difficult to find out the encoding of the schema, so for now the bindings will be encoded in utf-8. People using the bindings in their own Python scripts can use any compatible encoding in their scripts.
When using Python3, I think PyXB should allow the encoding from the schema to be used, so the identifiers do not change.
There will be a clean way to replace the parser in a patch I will send in a few hours.
That makes it much more clear ;-)
Good. That example will show you how to solve this problem and #141. I will update this ticket when the example is ready.
Fixed in the following commit. The unicode_jp example shows how to use it.
The "remaining issue" is that the solution doesn't work when parsing documents that have been built up in memory; see #147. I expect to fix this soon, but it'll take too long and I promised to have 1.1.4 out tomorrow.
If you can, please checkout the next branch from git and see whether the example works. I hope you'll be pleased with it; I think it's really neat, especially that you can use shift_jis in the Python code that interacts with the bindings.
Thank you for the schema and the suggestions that led to this. I hope the need for a customized "pyxbgen_jp" isn't a problem; the one in the example should do what you want.
commit 9b48a3122c5d8bcd38ebcdfbc614da19eeade530
Author: Peter A. Bigot <pabigot@…>
Date: Thu Jun 14 12:18:17 2012 -0500
There is one point.
Schema, and data is available from http://fgd.gsi.go.jp/download/,
not http://www.gsi.go.jp/LAW/2930-index.html, and
http://www.gsi.go.jp/LAW/2930-index.html mentions legal issues
for distributions.
Thank you for your excellent work.
Replying to hhsprings:
Thanks; I'll update the readme.
You're welcome.