[Python-markdown-discuss] Markdown encoding

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a 
UnicodeDecodeError in removeBOM():

In [3]: import markdown
In [4]: text = u'\xe2'.encode('utf-8')
In [6]: print text
â
In [7]: print markdown.markdown(text)
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
   File 
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", 
line 1722, in markdown
     return md.convert(text)
   File 
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", 
line 1614, in convert
     self.source = removeBOM(self.source, self.encoding)
   File 
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", 
line 74, in removeBOM
     if text.startswith(bom):
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 
0xc3 in position 0: ordinal not in range(128)

The problem is that the BOM being tested is unicode so to execute
   text.startswith(bom)
Python tries to convert text to Unicode using the default encoding 
(ascii). This fails because the text is not ascii.

I'm trying to understand what the encoding parameter is for; it doesn't 
seem to do much. There also seems to be some confusion with the use of 
encoding in markdownFromFile() vs markdown(); the file is converted to 
Unicode on input so I don't understand why the same encoding parameter 
is passed to markdown()?

ISTM the encoding passed to markdown should match the encoding of the 
text passed to markdown, and the values in the BOMS table should be in 
the encoding of the key, not in unicode. Then the __unicode__() method 
should actually decode. Or is the intent that the text passed to 
markdown() should always be ascii or unicode?

I can put together a patch if you like but I wanted to make sure that I 
am not missing some grand plan...

Kent