Re: [Python-markdown-discuss] UnicodeDecodeError

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Feb 26, 2009, at 11:17 AM, tchomby wrote:

> Thanks.
>
> I don't know what encoding the files are in. They're just files that I
> created myself with a text editor, but often text has been copy-pasted
> into them from various sources, e.g. websites, and that's were the
> decoding problems occur. Presumably some non-utf8 characters get
> pasted in.
>
> I used the codecs.open trick when reading files and again when writing
> the HTML from python-markdown, wherever I was using open I replaced it
> with codecs.open. This works for most of my files but for some I get:
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position
> 2551: unexpected code byte

This is because you have a string that is encoded using some encoding  
other than utf-8, most likely latin-1.

You might find this example enlightening:

 >>> unicode('\xa2', 'utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 0:  
unexpected code byte
 >>> unicode('\xa2', 'latin-1')
u'\xa2'
 >>> print _
¢
 >>> unicode('\xa2', 'latin-1').encode('utf-8')
'\xc2\xa2'
 >>> unicode('\xc2\xa2', 'utf-8')
u'\xa2'

rg