From: Waylan L. <wa...@gm...> - 2007-10-30 20:05:19
|
On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > On 10/30/07, Yuri Takhteyev <qar...@gm...> wrote: > > I haven't had a chance to look at the specific problem, but in > > general, here how it is _supposed_ to work. > > Ahh, well that clears a few things up for me. Thanks for the explaination= . > > Now for my proposal: > > Lets leave md.convert the way it is. The user has to convert to > unicode first and then get unicode back in his code which he can do > with as he pleases. > > However, I see markdown.markdown() as a shortcut for the common case. > So maybe we could add some basic encoding/decoding for common cases. > The user must pass in the encoding (and perhaps an optional output > encoding) and, assuming the encoding actually matches (the users > responsability - otherwise it fails gracefully) things work fine. If > the user has a situation that doesn't fit the common case, then we > would expect that the encoding/decoding will be done manually with > md.convert. As long as the differances are clearly documented, that > should work fine. Of course, the more I think about this, the more is > feels like extra work I don't want to do. Any input? I should mention that I see all this happening in the markdown() function itself, not as part of Markdown. Markdown.__init__ or Markdown.convert will always get unicode. > > Obviously, markdownFromFile is a differant animal and should work as > you proposed. > > > > > > The Markdown class is unicode-in-unicode-out. It can take a simple > > string as input, but one should never pass an encoded string to it, be > > it utf8 or whatever. > > It's the callers responsibility to decode their text into unicode from > > utf8 or whatever it is that they have it encoded as, and they can then > > encode the output into whatever encoding they want. Then I got a > > patch for removing BOM and integrated it without thinking, which > > required passing "encoding" to it. Looking at it now I realize that > > that was quite stupid. Since removeBOM() should never get encoded > > strings, should _assume_ that the input is unicode, so presumably it > > should suffice to have: > > > > def removeBOM(text, encoding): > > return text.lstrip(u'\ufeff') > > > > In fact, we should just get rid of this function and put > > text.lstrip(u'\ufeff') in the place where it is called. (BTW, should > > we put it back into the output?) > > > > Again, if you are using markdown as a module, you should decode your > > content yourself, run it through md.convert(), and then use the > > resulting unicode as you wish: > > > > input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf= 16") > > text =3D input_file.read() > > html_unicode =3D Markdown.markdown(text, extensions) > > output_file =3D codecs.open("test.html", "w", encoding=3D"utf8") > > output_file.write(html_unicode) > > > > Perhaps we should raise an error if we get an encoded string? I.e., > > check that either the string is of type unicode _or_ it has no special > > characters. > > > > Markdown.markdown does have an obvious bug in that it accepts an > > encoding argument and doesn't pass it to Markdown.__init__. I suppose > > we should just get of this parameter altogether. > > > > There is also another utility function - markdownFromFile. This one > > does the encoding and decoding for you. For simplicity, it uses only > > one encoding argument, which is used for both decoding the input and > > encoding output. I suppose that this might be confusing. Should we > > add an extra argument "output_encoding" making it optional? I.e.: > > > > def markdownFromFile(input =3D None, > > output =3D None, > > extensions =3D [], > > encoding =3D None, > > output_encoding =3D None, > > message_threshold =3D CRITICAL, > > safe =3D False) : > > if not output_encoding: > > output encoding =3D encoding > > > > I must admit here that I just went to look at the documentation on the > > wiki and am realizing that that's what is responsible for much of the > > confusion. We have a new wiki at http://markdown.freewisdom.org/ and > > I am slowly moving content there. In particular, I copied over the > > content of http://markdown.freewisdom.org/Using_as_a_Module and > > updated it with the example above. > > > > We should perhaps create a page called "BOMs" to archive there the > > design decisions related to BOM removal, etc. > > > > - yuri > > > > > > On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > > > Kent, thanks for the info. We'll look at this further. > > > > > > On 10/30/07, Kent Johnson <ke...@td...> wrote: > > > > Waylan Limberg wrote: > > > > > Kent, > > > > > > > > > > Could you verify that revision 46 fixes the problem for you? > > > > > > > > It will fix my problem but it won't work correctly with all unicode > > > > text. For example if the original text contains a BOM and it is > > > > converted with utf-16be or utf-16le encoding then the unicode strin= g > > > > still contains a BOM which will not be removed by this patch. > > > > > > My testing shows this works with utf-16. Could you provide a simple t= est case? > > > > > > > > > > > Also it still seems a bit strange that the encoding argument to > > > > markdown() is not used at all and the encoding argument to > > > > Markdown.__init__() is the encoding that the data was in *before* i= t was > > > > converted to unicode. > > > > > > > > I would write removeBOM() as > > > > > > > > def removeBOM(text, encoding): > > > > if isinstance(text, unicode): > > > > boms =3D [u'\ufeff'] > > > > else: > > > > boms =3D BOMS[encoding] > > > > for bom in boms: > > > > if text.startswith(bom): > > > > return text.lstrip(bom) > > > > return text > > > > > > > > and I would change the rest of the code to use encoding=3DNone when= the > > > > text is actually unicode. > > > > > > > > Kent > > > > > > > > > > > > > > We can thank the very smart Malcolm Tredinnick for providing a pa= tch. > > > > > See bug report [1817528] for more. > > > > > > > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote: > > > > >> Hi, > > > > >> > > > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails wit= h a > > > > >> UnicodeDecodeError in removeBOM(): > > > > >> > > > > >> In [3]: import markdown > > > > >> In [4]: text =3D u'\xe2'.encode('utf-8') > > > > >> In [6]: print text > > > > >> =E2 > > > > >> In [7]: print markdown.markdown(text) > > > > >> ------------------------------------------------------------ > > > > >> Traceback (most recent call last): > > > > >> File "<ipython console>", line 1, in <module> > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 1722, in markdown > > > > >> return md.convert(text) > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 1614, in convert > > > > >> self.source =3D removeBOM(self.source, self.encoding) > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 74, in removeBOM > > > > >> if text.startswith(bom): > > > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't deco= de byte > > > > >> 0xc3 in position 0: ordinal not in range(128) > > > > >> > > > > >> The problem is that the BOM being tested is unicode so to execut= e > > > > >> text.startswith(bom) > > > > >> Python tries to convert text to Unicode using the default encodi= ng > > > > >> (ascii). This fails because the text is not ascii. > > > > >> > > > > >> I'm trying to understand what the encoding parameter is for; it = doesn't > > > > >> seem to do much. There also seems to be some confusion with the = use of > > > > >> encoding in markdownFromFile() vs markdown(); the file is conver= ted to > > > > >> Unicode on input so I don't understand why the same encoding par= ameter > > > > >> is passed to markdown()? > > > > >> > > > > >> ISTM the encoding passed to markdown should match the encoding o= f the > > > > >> text passed to markdown, and the values in the BOMS table should= be in > > > > >> the encoding of the key, not in unicode. Then the __unicode__() = method > > > > >> should actually decode. Or is the intent that the text passed to > > > > >> markdown() should always be ascii or unicode? > > > > >> > > > > >> I can put together a patch if you like but I wanted to make sure= that I > > > > >> am not missing some grand plan... > > > > >> > > > > >> Kent > > > > >> > > > > >> ----------------------------------------------------------------= --------- > > > > >> This SF.net email is sponsored by: Microsoft > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > >> _______________________________________________ > > > > >> Python-markdown-discuss mailing list > > > > >> Pyt...@li... > > > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-dis= cuss > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ---- > > > Waylan Limberg > > > wa...@gm... > > > > > > ---------------------------------------------------------------------= ---- > > > This SF.net email is sponsored by: Splunk Inc. > > > Still grepping through log files to find problems? Stop. > > > Now Search log events and configuration files using AJAX and a browse= r. > > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > _______________________________________________ > > > Python-markdown-discuss mailing list > > > Pyt...@li... > > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > > > > > > > > -- > > Yuri Takhteyev > > Ph.D. Candidate, UC Berkeley School of Information > > http://takhteyev.org/, http://www.freewisdom.org/ > > > > > -- > ---- > Waylan Limberg > wa...@gm... > --=20 ---- Waylan Limberg wa...@gm... |