Menu

#25 UnicodeEncodeError

closed
nobody
None
5
2012-10-26
2011-07-31
Blackwell
No

Entering a German umlaut (for example "ä") causes this error for me:

Traceback (most recent call last):
File "C:\Users\me\pycmd\PyCmd.py", line 661, in <module>
main()
File "C:\Users\me\pycmd\PyCmd.py", line 153, in main
sys.stdout.write(line[:sel_start])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

These may be related:

http://stackoverflow.com/questions/1473577/writing-unicode-strings-via-sys-stdout-in-python
http://bugs.python.org/issue4947

This naive patch seems to fix the immediate problem, but it surely is not complete as it does not consider any places where sys.stdout.write() is being used:

diff --git a/PyCmd.py b/PyCmd.py
index bc178f5..a8ed7d0 100644
--- a/PyCmd.py
+++ b/PyCmd.py
@@ -150,11 +150,11 @@ def main():
cursor = len(state.before_cursor)
sel_start, sel_end = state.get_selection_range()
set_text_attributes(orig_attr)
- sys.stdout.write(line[:sel_start])
+ sys.stdout.write(line[:sel_start].encode(sys.stdout.encoding))
set_text_attributes(orig_attr ^ FOREGROUND_WHITE ^ BACKGROUND_WHITE)
- sys.stdout.write(line[sel_start: sel_end])
+ sys.stdout.write(line[sel_start: sel_end].encode(sys.stdout.encoding))
set_text_attributes(orig_attr)
- sys.stdout.write(line[sel_end:])
+ sys.stdout.write(line[sel_end:].encode(sys.stdout.encoding))
else:
# print '\n\n', (before_cursor + after_cursor).split(history_filter), '\n\n'
(chunks, seps) = split_nocase(state.before_cursor + state.after_cursor, state.history_filter)

Discussion

  • Blackwell

    Blackwell - 2011-07-31

    I forgot to add relevant details, sorry:

    • Windows 7, German
    • Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32
    • cmd.exe codepage is 850.
    • The error does not occur when starting cmd.exe with "cmd.exe /u ...", but then I have other problems in Console2.
    • In Python sys.getdefaultencoding() returns "ascii".
    • In Python sys.stdout.encoding is "cp850".
     
  • Horea Haitonic

    Horea Haitonic - 2011-08-04

    The report is correct, PyCmd cannot currently handle non-English characters. The fix you suggest looks good, I can try to apply it to all the places where sys.stdout.write is used. Thank you very much!

    p.s. 1. If you feel like it and have time to spend, you can implement this yourself and send me a patch to apply on your behalf (this shouldn't be too hard, I am thinking of a write_str() function in console.py to replace all invocations of sys.stdout.write())

    p.s. 2. Odds are that PyCmd will remain i18n-challenged in other respects; we will fix those along the way...

     
  • Blackwell

    Blackwell - 2011-09-04

    Hello.

    I have uploaded an updated patch for this. Again this patch is just fixed the places where I ran into problems while trying a few simple things. It is not the product of a thorough code analysis.)

    The patch attempts to use Unicode everywhere, but I certainly missed a few string literals ('foo' instead of u'foo').

    What worries me about this problem area is that getting the encodings right seems like a little intellectual task of its own, so my fear is that any future change may easily break things again (a string literal instead of a Unicode string literal in the wrong place would be enough).

    For this reason I guess respective tests should ensure that future changes don't break things too easily.

    For better overview, here is a list of the encodings and in which places they are being used currently (with the latest patch applied):

    The encoding sys.stdout.encoding is used by/for these:

    • sys.stdout
    • The file tmpfile.
    • The file tmpfile_errorlevel.
    • Values in environment variables (inside the PyCmd process itself).

    The encoding sys.getfilesystemencoding() is used by these:

    • Functions in os.* and their return values.
    • Functions in os.path.* and their return values.

    The encoding UTF-8 is used by these:

    • The command history file.
    • The directory history file.
     
  • Horea Haitonic

    Horea Haitonic - 2011-09-08

    O my machine, the patched version ruined the dir_history file with respect to newlines; this makes PyCmd crash when starting in certain locations. Looking through the patch that you sent, I found that it adds a suspicious line after line PyCmd:664 that might be the culprit. Did I miss something? Is the history_file.writelines(lines) intended?

     
  • Blackwell

    Blackwell - 2011-09-08

    Corrected issue raised in comment 2011-09-08 16:03:26 CEST

     
  • Blackwell

    Blackwell - 2011-09-08

    Hello Horea,

    thank you for looking at the patch.

    That particular line was indeed an unwanted left-over from previous code changes.

    I uploaded an updated patch that does not contain this line anymore.

    (While testing I also ran into the problem of the history being odd, but I ignored the issue, assuming that it was related to the switch from platform encoding to UTF-8 in the file itself.)

    Kind regards

    Clemens Anhuth

     
  • Horea Haitonic

    Horea Haitonic - 2011-09-13

    This works much better; there are still some migration problems, when the old history contains non-ASCII characters (based on the platform encoding) PyCmd crashes. But I plan to fix those in a separate commit, after creating some unit tests. Speaking of which: if I apply this patch, will the unit tests for ERRORLEVEL still apply?

    Thanks!

     
  • Nobody/Anonymous

    Hello Horea,

    I did not try because I want to keep the complexity of every step low and I cannot foresee which patches will be approved by you and which won't. (Another way would be to say I am just lazy.)

    I would prefer that the Unicode patch only introduces Unicode and nothing else. Subsequent patches should fix any problems that arise from the (admittedly far reaching) Unicode patch to keep each patch as small, logical and independent as possible.

    Do you agree with this or would you rather have me adjust the ERRORLEVEL tests for the Unicode patch?

    Kind regards

    Clemens Anhuth

     
  • Horea Haitonic

    Horea Haitonic - 2011-09-14

    I totally agree, this was my plan exactly; I applied the UNICODE patch, then the ERRORLEVEL tests (they apply and work fine), and I'm now working on fixing the migration issues that I can find (I have a few old history files that I want to try). When I'm done, everything gets pushed to SF.

     
  • Blackwell

    Blackwell - 2011-09-14

    Add some tests for Unicode. Just a starting point, more tests are required.

     
  • Blackwell

    Blackwell - 2011-09-14

    Hello Horea,

    great.

    I have uploaded a few Unicode tests that I started to write. Initially I wanted to make it so that the tests push key presses into the input key of the console. This is something that can probably be achieved, but because this would require using several Win32 functions (key code to scan code, etc.) I refrained and instead went with calling the function for processing input directly.

    Perhaps you have a better idea how to do tests for this.

    What I was missing when writing the tests was a way to query or specify the paths for the dir_history and history files (to fake different history content and encoding). Perhaps it would be best if one could set these file names, because then one could set them to temp files and not be concerned with backing up, altering and restoring the original files in the tests.

    Kind regards

    Clemens Anhuth

     
  • Horea Haitonic

    Horea Haitonic - 2011-09-14

    What you are doing here is system testing, as opposed to unit testing (i.e. you run PyCmd's main()). Is there a reason for this? Why not test the individual pieces, e.g. test the history reading/'writing by calling read_history() and write_history() with certain tailored history files?

     
  • Horea Haitonic

    Horea Haitonic - 2011-09-16

    Hi Clemens,

    Should I be waiting for feedback here? I have applied the UNICODE patches quite a while ago on my machine, and everything works very well. I also have some very small fixes to smooth out migration for existing users; I would like to create tests for these, but I don't want to generate conflicts with your tree. The tests that you uploaded here are of some value, but I think we need some finer-grained tests as well. Are you writing any, or should I do this?

     
  • Nobody/Anonymous

    Hello Horea,

    first of all: You are right, they are system tests, not unit tests (it didn't occur to me to write unit tests, perhaps laziness got the better part of me). If they are too broad and if we can come up with finer grained tests, just go ahead. (I guess I wanted something that tests a lot as quickly as possible to safeguard us from sneaking in non unicode strings in the wrong places again.)

    I have not found the time to do anything again, and this weekend does not look any better. But eventually your commits are the ones that matter, so don't be too careful about screwing up my local patches. :-)

    Clemens

     
  • Horea Haitonic

    Horea Haitonic - 2011-09-20

    I pushed all the changes to the SF repo.

    I am not 100% satisfied with the tests (as stated in my previous comment), but they are better than no tests at all; we will improve them when we have the chance.

    Thank you for this excellent contribution, Clemens!

     
  • Nobody/Anonymous

    Hello Horea,

    thank you for having the changes.

    With regard to unit tests - I guess writing them becomes harder the later one starts with them. So I hope to get to writing some for at least my patches, as this is something that has struck me as absolutely logical (and necessary) when I read that this is how Python itself is being developed, too (no new feature without unit tests).

     

Log in to post a comment.