|
From: Ron G. <ro...@fl...> - 2009-02-26 19:32:14
|
On Feb 26, 2009, at 11:17 AM, tchomby wrote:
> Thanks.
>
> I don't know what encoding the files are in. They're just files that I
> created myself with a text editor, but often text has been copy-pasted
> into them from various sources, e.g. websites, and that's were the
> decoding problems occur. Presumably some non-utf8 characters get
> pasted in.
>
> I used the codecs.open trick when reading files and again when writing
> the HTML from python-markdown, wherever I was using open I replaced it
> with codecs.open. This works for most of my files but for some I get:
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position
> 2551: unexpected code byte
This is because you have a string that is encoded using some encoding
other than utf-8, most likely latin-1.
You might find this example enlightening:
>>> unicode('\xa2', 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 0:
unexpected code byte
>>> unicode('\xa2', 'latin-1')
u'\xa2'
>>> print _
¢
>>> unicode('\xa2', 'latin-1').encode('utf-8')
'\xc2\xa2'
>>> unicode('\xc2\xa2', 'utf-8')
u'\xa2'
rg
|