Re: [Python-markdown-discuss] Markdown encoding

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yuri Takhteyev wrote:

> The Markdown class is unicode-in-unicode-out.   It can take a simple
> string as input, but one should never pass an encoded string to it, be
> it utf8 or whatever.

> Since removeBOM() should never get encoded
> strings, should _assume_ that the input is unicode, so presumably it
> should suffice to have:
> 
>     def removeBOM(text, encoding):
>          return text.lstrip(u'\ufeff')

Sounds good to me.

> In fact, we should just get rid of this function and put
> text.lstrip(u'\ufeff') in the place where it is called.  (BTW, should
> we put it back into the output?)

Yes, and get rid of the encoding parameter to markdown() and 
Markdown.__init__() which then will not be used at all. That will reduce 
the confusion; as the code is written, it is not at all clear that it 
expects unicode text only (e.g. the comment mentions "The character 
encoding of <text>" which has no meaning if <text> is unicode).

> Perhaps we should raise an error if we get an encoded string?  I.e.,
> check that either the string is of type unicode _or_ it has no special
> characters.

Easy to do - just put
   self.source = unicode(source)
in Markdown.__init__()

> Markdown.markdown does have an obvious bug in that it accepts an
> encoding argument and doesn't pass it to Markdown.__init__.  I suppose
> we should just get of this parameter altogether.

Yes please!

Kent