#62 libgmail's marks utf-8 strings as Unicode

open
nobody
None
5
2009-06-11
2009-06-11
Anonymous
No

Hi :)

libgmail's (even latest CVS) function "_parsePage" is marking ALL strings coming from GMail as Unicode by prepending an "u" to the string, and it does it even for strings that are encoded in UTF-8.

For example, my name ("Raúl") comes from GMail as "Ra\xc3\xba\l", that is, encoded in UTF-8. This becomes u'Ra\xc3\xbal' in "_parsePage()", which means that it has been converted to u'Raúl'.

Later, libgmail does things like "whatever.decode('utf-8')", but that "whatever" is encoded like the u'Raúl' above, so the "decode()" call will generate an UnicodeDecodeError exception.

I don't know if GMail *always* returns its strings as UTF-8, so maybe the only fix is to get the encoding from the HTML headers sent by GMail :?

Raúl

Discussion

  • I'm not sure if my problem is caused by this but I guess it is. I'm trying demos/archive.py on my inbox and I get this error message:

    124f492701e039cd 3 mail
    124e6075c93834c1 1 mail

    Traceback (most recent call last):
    File "demos/archive.py", line 82, in <module>
    source = msg.source.replace("\r","").lstrip()
    File "/home/lukas/libgmail/libgmail.py", line 1499, in _getSource
    return to_unicode(self._source)
    File "/home/lukas/libgmail/libgmail.py", line 76, in to_unicode
    return xstr.decode('utf-8')
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1741-1743: invalid data

    I'm no python programmer but I've tried to dig a bit and found out that it's caused by this message (showing only part of it):

    User-Agent: Thunderbird 2.0.0.23 (X11/20090817)
    MIME-Version: 1.0
    To: =?windows-1252?Q?Luk=E1=9A_Jirkovsk=FD?= <nospam@gmail.com>
    Subject: mail
    Content-Type: text/plain; charset=windows-1252; format=flowed
    Content-Transfer-Encoding: 8bit

    Hi Luk��

    Note: Luk�� should be Lukáš

    I guess that the problem is the change of encoding from windows-1252 to utf-8 results in creating some non-existing character.

     
  • I forgot one thing: I'm using libgmail from recent cvs tree with Python 2.6.4 on Linux system using UTF-8 encoding.