|
From: Kent J. <ke...@td...> - 2007-09-12 13:18:50
|
Hi,
Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a
UnicodeDecodeError in removeBOM():
In [3]: import markdown
In [4]: text = u'\xe2'.encode('utf-8')
In [6]: print text
â
In [7]: print markdown.markdown(text)
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py",
line 1722, in markdown
return md.convert(text)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py",
line 1614, in convert
self.source = removeBOM(self.source, self.encoding)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py",
line 74, in removeBOM
if text.startswith(bom):
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xc3 in position 0: ordinal not in range(128)
The problem is that the BOM being tested is unicode so to execute
text.startswith(bom)
Python tries to convert text to Unicode using the default encoding
(ascii). This fails because the text is not ascii.
I'm trying to understand what the encoding parameter is for; it doesn't
seem to do much. There also seems to be some confusion with the use of
encoding in markdownFromFile() vs markdown(); the file is converted to
Unicode on input so I don't understand why the same encoding parameter
is passed to markdown()?
ISTM the encoding passed to markdown should match the encoding of the
text passed to markdown, and the values in the BOMS table should be in
the encoding of the key, not in unicode. Then the __unicode__() method
should actually decode. Or is the intent that the text passed to
markdown() should always be ascii or unicode?
I can put together a patch if you like but I wanted to make sure that I
am not missing some grand plan...
Kent
|