Py++ code generation crashes at non-Ascii character
Brought to you by:
mbaas,
roman_yakovenko
If the generated *.cc file contains non-Ascii characters (e.g. 0xe4 -- the Finnish a-umlaut), Py++ crashes with a UnicodeDecodeError. This also happens if the *previous* version of the file contains non-Ascii characters, presumably because of the comparison "fcontent == fcontent_new" in pyplusplus/file_writers/writer.py, line 109.
Logged In: YES
user_id=1084190
Originator: NO
This a known bug( http://language-binding.net/pyplusplus/peps/peps_index.html#unicode-support ) and I think it is the time to solve it.
Would you mind to attach file with non-ASCII characters inside? Also, as I don't have very good understanding of UNICODE, it could be nice if you can create some example, I could learn from, what I have to do.
Also can you suggest the place I should put encoding id.
Thanks.
Logged In: YES
user_id=122400
Originator: YES
File Added: unicode_bug.hh
Header file to reproduce the bug
Python file to reproduce the bug
Logged In: YES
user_id=122400
Originator: YES
File Added: unicode_bug.py
C++ file to reproduce the bug
Logged In: YES
user_id=122400
Originator: YES
The file unicode_bug.cc contains a non-Ascii character in the license text. This causes a crash regardless of what the new contents of the file will be, because of the comparison.
The file unicode_bug.py contains the same text in the license, which causes a crash when the file contents is to be written out.
I have not dealt much with unicode in Python, but I'll see if I can find the correct place where the encoding should be specified.
File Added: unicode_bug.cc
Logged In: YES
user_id=1084190
Originator: NO
I could be wrong, but it seems I found the only place I should change:
pyplusplus/file_writters/writer.py - writer_t.write_file
I attach the patch file. Can you temporary to change encoding to your one and test? If it works for you, than I think I will add the encoding information to module_builder_t.__init__ method.
Thanks.
File Added: writer.py.patch
Logged In: YES
user_id=122400
Originator: YES
Yes, this seems to work.
The file unicode_bug.py does not declare character encoding, for which Python 2.4 issues a deprecation warning. In Python 2.5 this has been changed to a syntax error, so the following line needs to be added to the beginning of the file:
# This file uses encoding: utf-8
With this change, code generation seems to work.
In order to be really bullet proof, the code in writer.py should check the type of fcontent_new. If the license text in unicode_bug.py is given as u"This file..." instead of "This file...", then the type of fcontent_new becomes <type 'unicode'> instead of <type 'str'>, resulting in TypeError: decoding Unicode is not supported.
Logged In: YES
user_id=1084190
Originator: NO
Thanks for the test. I will commit the changes this evening.
License is not the only text that could contain unicode: documentation, strings comments and C++ strings also can contain it. It is very hard to track all places, where the user can insert unicode.
The best and expensive solution is to change Py++ to work with encoding specified by user, but time constraints does not allow me to do this. I think the solution in it's current form is good enough and it solves the problem.
Logged In: YES
user_id=1084190
Originator: NO
Hi. I committed the changes to SVN. You will have to check out revision number 1069.
I also committed small usage example: http://pygccxml.svn.sourceforge.net/viewvc/pygccxml/pyplusplus_dev/unittests/unicode_bug.py?revision=1069&view=markup
I will create the unit test from it later.
Please verify my changes and whether they still works for you.
Thanks.
Logged In: YES
user_id=1084190
Originator: NO
Unicode support has been tested on few different computers\environments and it seems that it works pretty good.