From: Ron G. <ro...@fl...> - 2009-02-26 19:32:14
|
On Feb 26, 2009, at 11:17 AM, tchomby wrote: > Thanks. > > I don't know what encoding the files are in. They're just files that I > created myself with a text editor, but often text has been copy-pasted > into them from various sources, e.g. websites, and that's were the > decoding problems occur. Presumably some non-utf8 characters get > pasted in. > > I used the codecs.open trick when reading files and again when writing > the HTML from python-markdown, wherever I was using open I replaced it > with codecs.open. This works for most of my files but for some I get: > > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position > 2551: unexpected code byte This is because you have a string that is encoded using some encoding other than utf-8, most likely latin-1. You might find this example enlightening: >>> unicode('\xa2', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 0: unexpected code byte >>> unicode('\xa2', 'latin-1') u'\xa2' >>> print _ ¢ >>> unicode('\xa2', 'latin-1').encode('utf-8') '\xc2\xa2' >>> unicode('\xc2\xa2', 'utf-8') u'\xa2' rg |