From: Kent J. <ke...@td...> - 2007-09-12 13:18:50
|
Hi, Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a UnicodeDecodeError in removeBOM(): In [3]: import markdown In [4]: text = u'\xe2'.encode('utf-8') In [6]: print text â In [7]: print markdown.markdown(text) ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 1722, in markdown return md.convert(text) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 1614, in convert self.source = removeBOM(self.source, self.encoding) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 74, in removeBOM if text.startswith(bom): <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) The problem is that the BOM being tested is unicode so to execute text.startswith(bom) Python tries to convert text to Unicode using the default encoding (ascii). This fails because the text is not ascii. I'm trying to understand what the encoding parameter is for; it doesn't seem to do much. There also seems to be some confusion with the use of encoding in markdownFromFile() vs markdown(); the file is converted to Unicode on input so I don't understand why the same encoding parameter is passed to markdown()? ISTM the encoding passed to markdown should match the encoding of the text passed to markdown, and the values in the BOMS table should be in the encoding of the key, not in unicode. Then the __unicode__() method should actually decode. Or is the intent that the text passed to markdown() should always be ascii or unicode? I can put together a patch if you like but I wanted to make sure that I am not missing some grand plan... Kent |