C++ Python language bindings / Bugs / #8 Py++ code generation crashes at non-Ascii character

#8 Py++ code generation crashes at non-Ascii character

Status: closed-fixed

Owner: Roman

Labels: None

Priority: 5

Updated: 2008-02-06

Created: 2007-06-19

Creator: Pertti Kellomäki

Private: No

If the generated *.cc file contains non-Ascii characters (e.g. 0xe4 -- the Finnish a-umlaut), Py++ crashes with a UnicodeDecodeError. This also happens if the *previous* version of the file contains non-Ascii characters, presumably because of the comparison "fcontent == fcontent_new" in pyplusplus/file_writers/writer.py, line 109.

Discussion

Roman - 2007-06-20

Logged In: YES
user_id=1084190
Originator: NO

This a known bug( http://language-binding.net/pyplusplus/peps/peps_index.html#unicode-support ) and I think it is the time to solve it.

Would you mind to attach file with non-ASCII characters inside? Also, as I don't have very good understanding of UNICODE, it could be nice if you can create some example, I could learn from, what I have to do.

Also can you suggest the place I should put encoding id.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-06-20

assigned_to: nobody --> roman_yakovenko

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

Logged In: YES
user_id=122400
Originator: YES

File Added: unicode_bug.hh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

Header file to reproduce the bug

unicode_bug.hh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

Python file to reproduce the bug

unicode_bug.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

Logged In: YES
user_id=122400
Originator: YES

File Added: unicode_bug.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

C++ file to reproduce the bug

unicode_bug.cc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-20

Logged In: YES
user_id=122400
Originator: YES

The file unicode_bug.cc contains a non-Ascii character in the license text. This causes a crash regardless of what the new contents of the file will be, because of the comparison.

The file unicode_bug.py contains the same text in the license, which causes a crash when the file contents is to be written out.

I have not dealt much with unicode in Python, but I'll see if I can find the correct place where the encoding should be specified.
File Added: unicode_bug.cc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-06-21

writer.py.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-06-21

Logged In: YES
user_id=1084190
Originator: NO

I could be wrong, but it seems I found the only place I should change:
pyplusplus/file_writters/writer.py - writer_t.write_file

I attach the patch file. Can you temporary to change encoding to your one and test? If it works for you, than I think I will add the encoding information to module_builder_t.__init__ method.

Thanks.
File Added: writer.py.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pertti Kellomäki - 2007-06-21

Logged In: YES
user_id=122400
Originator: YES

Yes, this seems to work.

The file unicode_bug.py does not declare character encoding, for which Python 2.4 issues a deprecation warning. In Python 2.5 this has been changed to a syntax error, so the following line needs to be added to the beginning of the file:

# This file uses encoding: utf-8

With this change, code generation seems to work.

In order to be really bullet proof, the code in writer.py should check the type of fcontent_new. If the license text in unicode_bug.py is given as u"This file..." instead of "This file...", then the type of fcontent_new becomes <type 'unicode'> instead of <type 'str'>, resulting in TypeError: decoding Unicode is not supported.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-06-21

Logged In: YES
user_id=1084190
Originator: NO

Thanks for the test. I will commit the changes this evening.

License is not the only text that could contain unicode: documentation, strings comments and C++ strings also can contain it. It is very hard to track all places, where the user can insert unicode.

The best and expensive solution is to change Py++ to work with encoding specified by user, but time constraints does not allow me to do this. I think the solution in it's current form is good enough and it solves the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-06-21

Logged In: YES
user_id=1084190
Originator: NO

Hi. I committed the changes to SVN. You will have to check out revision number 1069.
I also committed small usage example: http://pygccxml.svn.sourceforge.net/viewvc/pygccxml/pyplusplus_dev/unittests/unicode_bug.py?revision=1069&view=markup

I will create the unit test from it later.

Please verify my changes and whether they still works for you.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-08-20

Logged In: YES
user_id=1084190
Originator: NO

Unicode support has been tested on few different computers\environments and it seems that it works pretty good.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2007-08-20

status: open-accepted --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roman - 2008-02-06

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.