[Boa Constr] adventures with unicode -- chapter II
Status: Beta
Brought to you by:
riaan
From: spir <den...@fr...> - 2008-10-02 15:17:02
|
Hello, As I told in a previous message, I have strange problems happening with unicode. When I start writing a new program with utf-8 encoding, eveything works fine for a while. Then unicode errors happen randomly. I tried to understand myself what's up with unicode, so I did a kind of dignostic. I wish to tell here all what I know about that problem, so that it will be a long message. First, some more precisions. I have the last versions of python and wxPython installed on an XP machine, and 3 IDEs which are all written in python and built on wx : drPython, boa and SPE. Also, I sometimes use notepad++. The problems I'm talking about are not * the processing of unicode data (python's unicode text type) * programming with words from other languages (having french or german variable names) but only the edition and run of a source code file encoded in utf-8. I don't need it, only to use french characters that are properly managed by latin-1, but I wanted to try again playing with unicode. The same codec errors happen with the 3 IDEs named above. Not with notepad++. Note that N++ is itself written in C++. When the problems happen, I'm still able to load the source in N++, change the codec to latin-1 (iso-8859-1) and read it, everything's all right. While the same procedure in one of the three other IDEs leads to other problems, and even with Python set to latin-1, the program won't run. So I decided to analyse the source file to try and find where the problem is. I wrote a script that does the following : [Note : characters with ordinals between 128 and 255 , thus encoded in a single byte in latin-1 will be coded in 2 bytrs in utf-8 -- see http://en.wikipedia.org/wiki/Utf-8]. -1- Read the source -2- Make a list of all bytes > 127 -3- Write these byte numbers and matching characters (like #193:Ã) -4- Look up in the source where these characters happen to be, and what should be there instead. There's always a pair of strange characters in place of a single 'normal' (for me) one. For instance, I may find "biêre" instead of "bière". -5- Replace all of these pairs of bugs with the expected characters. Then, the source text should be clean, properly encoded for e.g. latin-1, and acceptable for python. This process is a kind of adhoc transcoding from utf-8 to latin-1. But it still happens not to work! Which is expected, as otherwise, why did my IDEs (and python too) refuse the file when they where set to utf-8? This refusal shows that something was wrong in the utf-8 encoding itself. Actually, by looking in the text after the 'transcoding', I found a couple of remaining bugs, each made of a sequence of 3 bytes, and each at the place of an ordinary 'é' (ordinal #233) letter in the middle of a word. This is very strange, as * This letter os the most common in french, and all other ocuurences where properly processed by the transcoding procedure. * All french ordinary characters, especially on the keyboard, will be encoded on 3 bytes in utf-8. So that I can' have typed it as a typo. So how did these weird byte sequences happens to be in my source code file ? This is the point, I guess. I searched farther, first by checking that everything was solved if I corrected the errors. All right, all works fine again, both in the IDE and at run time (my program works! only python does not want it in utf-8). Digging further, I went back to the buggy version in order to follow the error traceback given by python. I have to swim a bit in the standard module, but finally found the source of the message in the utf-8.py that you should find in the /Lib/encodings directory. The following function launches the error: def decode(input, errors='strict'): return codecs.utf_8_decode(input, errors, True) I tried to get some information about the arguments with: def decode(input, errors='strict'): try: return codecs.utf_8_decode(input, errors, True) except UnicodeDecodeError: print "### input :###" print input print "##############" sys.exit() But for any reason, I got no output (because standard output should also have passed through the utf-8 encoding?). So I'm stuck. Denis |