[Boa Constr] adventures with unicode -- chapter II

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

As I told in a previous message, I have strange problems happening with 
unicode. When I start writing a new program with utf-8 encoding, 
eveything works fine for a while. Then unicode errors happen randomly.

I tried to understand myself what's up with unicode, so I did a kind of 
dignostic. I wish to tell here all what I know about that problem, so 
that it will be a long message.

First, some more precisions. I have the last versions of python and 
wxPython installed on an XP machine, and 3 IDEs which are all written in 
python and built on wx : drPython, boa and SPE. Also, I sometimes use 
notepad++.
The problems I'm talking about are not
* the processing of unicode data (python's unicode text type)
* programming with words from other languages (having french or german 
variable names)
but only the edition and run of a source code file encoded in utf-8. I 
don't need it, only to use french characters that are properly managed 
by latin-1, but I wanted to try again playing with unicode.
The same codec errors happen with the 3 IDEs named above. Not with 
notepad++. Note that N++ is itself written in C++. When the problems 
happen, I'm still able to load the source in N++, change the codec to 
latin-1 (iso-8859-1) and read it, everything's all right. While the same 
procedure in one of the three other IDEs leads to other problems, and 
even with Python set to latin-1, the program won't run.

So I decided to analyse the source file to try and find where the 
problem is. I wrote a script that does the following :
[Note : characters with ordinals between 128 and 255 , thus encoded in a 
single byte in latin-1 will be coded in 2 bytrs in utf-8 -- see 
http://en.wikipedia.org/wiki/Utf-8].
-1- Read the source
-2- Make a list of all bytes > 127
-3- Write these byte numbers and matching characters (like #193:Ã)
-4- Look up in the source where these characters happen to be, and what 
should be there instead. There's always a pair of strange characters in 
place of a single 'normal' (for me) one. For instance, I may find 
"biÃªre" instead of "bière".
-5- Replace all of these pairs of bugs with the expected characters.
Then, the source text should be clean, properly encoded for e.g. 
latin-1, and acceptable for python. This process is a kind of adhoc 
transcoding from utf-8 to latin-1.

But it still happens not to work! Which is expected, as otherwise, why 
did my IDEs (and python too) refuse the file when they where set to 
utf-8? This refusal shows that something was wrong in the utf-8 encoding 
itself.
Actually, by looking in the text after the 'transcoding', I found a 
couple of remaining bugs, each made of a sequence of 3 bytes, and each 
at the place of an ordinary 'é' (ordinal #233) letter in the middle of a 
word. This is very strange, as
* This letter os the most common in french, and all other ocuurences 
where properly processed by the transcoding procedure.
* All french ordinary characters, especially on the keyboard, will be 
encoded on 3 bytes in utf-8. So that I can' have typed it as a typo.

So how did these weird byte sequences happens to be in my source code 
file ? This is the point, I guess. I searched farther, first by checking 
that everything was solved if I corrected the errors. All right, all 
works fine again, both in the IDE and at run time (my program works! 
only python does not want it in utf-8).
Digging further, I went back to the buggy version in order to follow the 
error traceback given by python. I have to swim a bit in the standard 
module, but finally found the source of the message in the utf-8.py that 
you should find in the /Lib/encodings directory. The following function 
launches the error:
def decode(input, errors='strict'):
    return codecs.utf_8_decode(input, errors, True)
I tried to get some information about the arguments with:
def decode(input, errors='strict'):
    try:
        return codecs.utf_8_decode(input, errors, True)
    except UnicodeDecodeError:
        print "### input :###"
        print input
        print "##############"
        sys.exit()
But for any reason, I got no output (because standard output should also 
have passed through the utf-8 encoding?). So I'm stuck.

Denis