From: Kent J. <ke...@td...> - 2007-10-30 19:53:18
|
Yuri Takhteyev wrote: > The Markdown class is unicode-in-unicode-out. It can take a simple > string as input, but one should never pass an encoded string to it, be > it utf8 or whatever. > Since removeBOM() should never get encoded > strings, should _assume_ that the input is unicode, so presumably it > should suffice to have: > > def removeBOM(text, encoding): > return text.lstrip(u'\ufeff') Sounds good to me. > In fact, we should just get rid of this function and put > text.lstrip(u'\ufeff') in the place where it is called. (BTW, should > we put it back into the output?) Yes, and get rid of the encoding parameter to markdown() and Markdown.__init__() which then will not be used at all. That will reduce the confusion; as the code is written, it is not at all clear that it expects unicode text only (e.g. the comment mentions "The character encoding of <text>" which has no meaning if <text> is unicode). > Perhaps we should raise an error if we get an encoded string? I.e., > check that either the string is of type unicode _or_ it has no special > characters. Easy to do - just put self.source = unicode(source) in Markdown.__init__() > Markdown.markdown does have an obvious bug in that it accepts an > encoding argument and doesn't pass it to Markdown.__init__. I suppose > we should just get of this parameter altogether. Yes please! Kent |