Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#8 Py++ code generation crashes at non-Ascii character

closed-fixed
Roman
None
5
2008-02-06
2007-06-19
No

If the generated *.cc file contains non-Ascii characters (e.g. 0xe4 -- the Finnish a-umlaut), Py++ crashes with a UnicodeDecodeError. This also happens if the *previous* version of the file contains non-Ascii characters, presumably because of the comparison "fcontent == fcontent_new" in pyplusplus/file_writers/writer.py, line 109.

Discussion

  • Roman
    Roman
    2007-06-20

    Logged In: YES
    user_id=1084190
    Originator: NO

    This a known bug( http://language-binding.net/pyplusplus/peps/peps_index.html#unicode-support ) and I think it is the time to solve it.

    Would you mind to attach file with non-ASCII characters inside? Also, as I don't have very good understanding of UNICODE, it could be nice if you can create some example, I could learn from, what I have to do.

    Also can you suggest the place I should put encoding id.

    Thanks.

     
  • Roman
    Roman
    2007-06-20

    • assigned_to: nobody --> roman_yakovenko
    • status: open --> open-accepted
     
  • Logged In: YES
    user_id=122400
    Originator: YES

    File Added: unicode_bug.hh

     
  • Header file to reproduce the bug

     
    Attachments
  • Python file to reproduce the bug

     
    Attachments
  • Logged In: YES
    user_id=122400
    Originator: YES

    File Added: unicode_bug.py

     
  • C++ file to reproduce the bug

     
    Attachments
  • Logged In: YES
    user_id=122400
    Originator: YES

    The file unicode_bug.cc contains a non-Ascii character in the license text. This causes a crash regardless of what the new contents of the file will be, because of the comparison.

    The file unicode_bug.py contains the same text in the license, which causes a crash when the file contents is to be written out.

    I have not dealt much with unicode in Python, but I'll see if I can find the correct place where the encoding should be specified.
    File Added: unicode_bug.cc

     
  • Roman
    Roman
    2007-06-21

     
    Attachments
  • Roman
    Roman
    2007-06-21

    Logged In: YES
    user_id=1084190
    Originator: NO

    I could be wrong, but it seems I found the only place I should change:
    pyplusplus/file_writters/writer.py - writer_t.write_file

    I attach the patch file. Can you temporary to change encoding to your one and test? If it works for you, than I think I will add the encoding information to module_builder_t.__init__ method.

    Thanks.
    File Added: writer.py.patch

     
  • Logged In: YES
    user_id=122400
    Originator: YES

    Yes, this seems to work.

    The file unicode_bug.py does not declare character encoding, for which Python 2.4 issues a deprecation warning. In Python 2.5 this has been changed to a syntax error, so the following line needs to be added to the beginning of the file:

    # This file uses encoding: utf-8

    With this change, code generation seems to work.

    In order to be really bullet proof, the code in writer.py should check the type of fcontent_new. If the license text in unicode_bug.py is given as u"This file..." instead of "This file...", then the type of fcontent_new becomes <type 'unicode'> instead of <type 'str'>, resulting in TypeError: decoding Unicode is not supported.

     
  • Roman
    Roman
    2007-06-21

    Logged In: YES
    user_id=1084190
    Originator: NO

    Thanks for the test. I will commit the changes this evening.

    License is not the only text that could contain unicode: documentation, strings comments and C++ strings also can contain it. It is very hard to track all places, where the user can insert unicode.

    The best and expensive solution is to change Py++ to work with encoding specified by user, but time constraints does not allow me to do this. I think the solution in it's current form is good enough and it solves the problem.

     
  • Roman
    Roman
    2007-08-20

    Logged In: YES
    user_id=1084190
    Originator: NO

    Unicode support has been tested on few different computers\environments and it seems that it works pretty good.

     
  • Roman
    Roman
    2007-08-20

    • status: open-accepted --> open-fixed
     
  • Roman
    Roman
    2008-02-06

    • status: open-fixed --> closed-fixed