From: Kent J. <ke...@td...> - 2007-09-12 13:18:50
|
Hi, Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a UnicodeDecodeError in removeBOM(): In [3]: import markdown In [4]: text = u'\xe2'.encode('utf-8') In [6]: print text â In [7]: print markdown.markdown(text) ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 1722, in markdown return md.convert(text) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 1614, in convert self.source = removeBOM(self.source, self.encoding) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/markdown.py", line 74, in removeBOM if text.startswith(bom): <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) The problem is that the BOM being tested is unicode so to execute text.startswith(bom) Python tries to convert text to Unicode using the default encoding (ascii). This fails because the text is not ascii. I'm trying to understand what the encoding parameter is for; it doesn't seem to do much. There also seems to be some confusion with the use of encoding in markdownFromFile() vs markdown(); the file is converted to Unicode on input so I don't understand why the same encoding parameter is passed to markdown()? ISTM the encoding passed to markdown should match the encoding of the text passed to markdown, and the values in the BOMS table should be in the encoding of the key, not in unicode. Then the __unicode__() method should actually decode. Or is the intent that the text passed to markdown() should always be ascii or unicode? I can put together a patch if you like but I wanted to make sure that I am not missing some grand plan... Kent |
From: Yuri T. <qar...@gm...> - 2007-09-16 10:10:49
|
Thanks for reporting this. I will look into it. - yuri On 9/12/07, Kent Johnson <ke...@td...> wrote: > Hi, > > Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a > UnicodeDecodeError in removeBOM(): > > In [3]: import markdown > In [4]: text =3D u'\xe2'.encode('utf-8') > In [6]: print text > =E2 > In [7]: print markdown.markdown(text) > ------------------------------------------------------------ > Traceback (most recent call last): > File "<ipython console>", line 1, in <module> > File > "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-pac= kages/markdown.py", > line 1722, in markdown > return md.convert(text) > File > "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-pac= kages/markdown.py", > line 1614, in convert > self.source =3D removeBOM(self.source, self.encoding) > File > "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-pac= kages/markdown.py", > line 74, in removeBOM > if text.startswith(bom): > <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte > 0xc3 in position 0: ordinal not in range(128) > > The problem is that the BOM being tested is unicode so to execute > text.startswith(bom) > Python tries to convert text to Unicode using the default encoding > (ascii). This fails because the text is not ascii. > > I'm trying to understand what the encoding parameter is for; it doesn't > seem to do much. There also seems to be some confusion with the use of > encoding in markdownFromFile() vs markdown(); the file is converted to > Unicode on input so I don't understand why the same encoding parameter > is passed to markdown()? > > ISTM the encoding passed to markdown should match the encoding of the > text passed to markdown, and the values in the BOMS table should be in > the encoding of the key, not in unicode. Then the __unicode__() method > should actually decode. Or is the intent that the text passed to > markdown() should always be ascii or unicode? > > I can put together a patch if you like but I wanted to make sure that I > am not missing some grand plan... > > Kent > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > --=20 Yuri Takhteyev Ph.D. Candidate, UC Berkeley School of Information http://takhteyev.org/, http://www.freewisdom.org/ |
From: Waylan L. <wa...@gm...> - 2007-10-30 16:51:33
|
Kent, thanks for the info. We'll look at this further. On 10/30/07, Kent Johnson <ke...@td...> wrote: > Waylan Limberg wrote: > > Kent, > > > > Could you verify that revision 46 fixes the problem for you? > > It will fix my problem but it won't work correctly with all unicode > text. For example if the original text contains a BOM and it is > converted with utf-16be or utf-16le encoding then the unicode string > still contains a BOM which will not be removed by this patch. My testing shows this works with utf-16. Could you provide a simple test ca= se? > > Also it still seems a bit strange that the encoding argument to > markdown() is not used at all and the encoding argument to > Markdown.__init__() is the encoding that the data was in *before* it was > converted to unicode. > > I would write removeBOM() as > > def removeBOM(text, encoding): > if isinstance(text, unicode): > boms =3D [u'\ufeff'] > else: > boms =3D BOMS[encoding] > for bom in boms: > if text.startswith(bom): > return text.lstrip(bom) > return text > > and I would change the rest of the code to use encoding=3DNone when the > text is actually unicode. > > Kent > > > > > We can thank the very smart Malcolm Tredinnick for providing a patch. > > See bug report [1817528] for more. > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote: > >> Hi, > >> > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a > >> UnicodeDecodeError in removeBOM(): > >> > >> In [3]: import markdown > >> In [4]: text =3D u'\xe2'.encode('utf-8') > >> In [6]: print text > >> =E2 > >> In [7]: print markdown.markdown(text) > >> ------------------------------------------------------------ > >> Traceback (most recent call last): > >> File "<ipython console>", line 1, in <module> > >> File > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-= packages/markdown.py", > >> line 1722, in markdown > >> return md.convert(text) > >> File > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-= packages/markdown.py", > >> line 1614, in convert > >> self.source =3D removeBOM(self.source, self.encoding) > >> File > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-= packages/markdown.py", > >> line 74, in removeBOM > >> if text.startswith(bom): > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byt= e > >> 0xc3 in position 0: ordinal not in range(128) > >> > >> The problem is that the BOM being tested is unicode so to execute > >> text.startswith(bom) > >> Python tries to convert text to Unicode using the default encoding > >> (ascii). This fails because the text is not ascii. > >> > >> I'm trying to understand what the encoding parameter is for; it doesn'= t > >> seem to do much. There also seems to be some confusion with the use of > >> encoding in markdownFromFile() vs markdown(); the file is converted to > >> Unicode on input so I don't understand why the same encoding parameter > >> is passed to markdown()? > >> > >> ISTM the encoding passed to markdown should match the encoding of the > >> text passed to markdown, and the values in the BOMS table should be in > >> the encoding of the key, not in unicode. Then the __unicode__() method > >> should actually decode. Or is the intent that the text passed to > >> markdown() should always be ascii or unicode? > >> > >> I can put together a patch if you like but I wanted to make sure that = I > >> am not missing some grand plan... > >> > >> Kent > >> > >> ----------------------------------------------------------------------= --- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Python-markdown-discuss mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > >> > > > > > > --=20 ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2007-10-30 19:28:35
|
I haven't had a chance to look at the specific problem, but in general, here how it is _supposed_ to work. The Markdown class is unicode-in-unicode-out. It can take a simple string as input, but one should never pass an encoded string to it, be it utf8 or whatever. It's the callers responsibility to decode their text into unicode from utf8 or whatever it is that they have it encoded as, and they can then encode the output into whatever encoding they want. Then I got a patch for removing BOM and integrated it without thinking, which required passing "encoding" to it. Looking at it now I realize that that was quite stupid. Since removeBOM() should never get encoded strings, should _assume_ that the input is unicode, so presumably it should suffice to have: def removeBOM(text, encoding): return text.lstrip(u'\ufeff') In fact, we should just get rid of this function and put text.lstrip(u'\ufeff') in the place where it is called. (BTW, should we put it back into the output?) Again, if you are using markdown as a module, you should decode your content yourself, run it through md.convert(), and then use the resulting unicode as you wish: input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf16") text =3D input_file.read() html_unicode =3D Markdown.markdown(text, extensions) output_file =3D codecs.open("test.html", "w", encoding=3D"utf8") output_file.write(html_unicode) Perhaps we should raise an error if we get an encoded string? I.e., check that either the string is of type unicode _or_ it has no special characters. Markdown.markdown does have an obvious bug in that it accepts an encoding argument and doesn't pass it to Markdown.__init__. I suppose we should just get of this parameter altogether. There is also another utility function - markdownFromFile. This one does the encoding and decoding for you. For simplicity, it uses only one encoding argument, which is used for both decoding the input and encoding output. I suppose that this might be confusing. Should we add an extra argument "output_encoding" making it optional? I.e.: def markdownFromFile(input =3D None, output =3D None, extensions =3D [], encoding =3D None, output_encoding =3D None, message_threshold =3D CRITICAL, safe =3D False) : if not output_encoding: output encoding =3D encoding I must admit here that I just went to look at the documentation on the wiki and am realizing that that's what is responsible for much of the confusion. We have a new wiki at http://markdown.freewisdom.org/ and I am slowly moving content there. In particular, I copied over the content of http://markdown.freewisdom.org/Using_as_a_Module and updated it with the example above. We should perhaps create a page called "BOMs" to archive there the design decisions related to BOM removal, etc. - yuri On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > Kent, thanks for the info. We'll look at this further. > > On 10/30/07, Kent Johnson <ke...@td...> wrote: > > Waylan Limberg wrote: > > > Kent, > > > > > > Could you verify that revision 46 fixes the problem for you? > > > > It will fix my problem but it won't work correctly with all unicode > > text. For example if the original text contains a BOM and it is > > converted with utf-16be or utf-16le encoding then the unicode string > > still contains a BOM which will not be removed by this patch. > > My testing shows this works with utf-16. Could you provide a simple test = case? > > > > > Also it still seems a bit strange that the encoding argument to > > markdown() is not used at all and the encoding argument to > > Markdown.__init__() is the encoding that the data was in *before* it wa= s > > converted to unicode. > > > > I would write removeBOM() as > > > > def removeBOM(text, encoding): > > if isinstance(text, unicode): > > boms =3D [u'\ufeff'] > > else: > > boms =3D BOMS[encoding] > > for bom in boms: > > if text.startswith(bom): > > return text.lstrip(bom) > > return text > > > > and I would change the rest of the code to use encoding=3DNone when the > > text is actually unicode. > > > > Kent > > > > > > > > We can thank the very smart Malcolm Tredinnick for providing a patch. > > > See bug report [1817528] for more. > > > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote: > > >> Hi, > > >> > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a > > >> UnicodeDecodeError in removeBOM(): > > >> > > >> In [3]: import markdown > > >> In [4]: text =3D u'\xe2'.encode('utf-8') > > >> In [6]: print text > > >> =E2 > > >> In [7]: print markdown.markdown(text) > > >> ------------------------------------------------------------ > > >> Traceback (most recent call last): > > >> File "<ipython console>", line 1, in <module> > > >> File > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit= e-packages/markdown.py", > > >> line 1722, in markdown > > >> return md.convert(text) > > >> File > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit= e-packages/markdown.py", > > >> line 1614, in convert > > >> self.source =3D removeBOM(self.source, self.encoding) > > >> File > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit= e-packages/markdown.py", > > >> line 74, in removeBOM > > >> if text.startswith(bom): > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode b= yte > > >> 0xc3 in position 0: ordinal not in range(128) > > >> > > >> The problem is that the BOM being tested is unicode so to execute > > >> text.startswith(bom) > > >> Python tries to convert text to Unicode using the default encoding > > >> (ascii). This fails because the text is not ascii. > > >> > > >> I'm trying to understand what the encoding parameter is for; it does= n't > > >> seem to do much. There also seems to be some confusion with the use = of > > >> encoding in markdownFromFile() vs markdown(); the file is converted = to > > >> Unicode on input so I don't understand why the same encoding paramet= er > > >> is passed to markdown()? > > >> > > >> ISTM the encoding passed to markdown should match the encoding of th= e > > >> text passed to markdown, and the values in the BOMS table should be = in > > >> the encoding of the key, not in unicode. Then the __unicode__() meth= od > > >> should actually decode. Or is the intent that the text passed to > > >> markdown() should always be ascii or unicode? > > >> > > >> I can put together a patch if you like but I wanted to make sure tha= t I > > >> am not missing some grand plan... > > >> > > >> Kent > > >> > > >> --------------------------------------------------------------------= ----- > > >> This SF.net email is sponsored by: Microsoft > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > >> _______________________________________________ > > >> Python-markdown-discuss mailing list > > >> Pyt...@li... > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > >> > > > > > > > > > > > > > -- > ---- > Waylan Limberg > wa...@gm... > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > --=20 Yuri Takhteyev Ph.D. Candidate, UC Berkeley School of Information http://takhteyev.org/, http://www.freewisdom.org/ |
From: Kent J. <ke...@td...> - 2007-10-30 19:53:18
|
Yuri Takhteyev wrote: > The Markdown class is unicode-in-unicode-out. It can take a simple > string as input, but one should never pass an encoded string to it, be > it utf8 or whatever. > Since removeBOM() should never get encoded > strings, should _assume_ that the input is unicode, so presumably it > should suffice to have: > > def removeBOM(text, encoding): > return text.lstrip(u'\ufeff') Sounds good to me. > In fact, we should just get rid of this function and put > text.lstrip(u'\ufeff') in the place where it is called. (BTW, should > we put it back into the output?) Yes, and get rid of the encoding parameter to markdown() and Markdown.__init__() which then will not be used at all. That will reduce the confusion; as the code is written, it is not at all clear that it expects unicode text only (e.g. the comment mentions "The character encoding of <text>" which has no meaning if <text> is unicode). > Perhaps we should raise an error if we get an encoded string? I.e., > check that either the string is of type unicode _or_ it has no special > characters. Easy to do - just put self.source = unicode(source) in Markdown.__init__() > Markdown.markdown does have an obvious bug in that it accepts an > encoding argument and doesn't pass it to Markdown.__init__. I suppose > we should just get of this parameter altogether. Yes please! Kent |
From: Waylan L. <wa...@gm...> - 2007-10-30 20:01:43
|
On 10/30/07, Yuri Takhteyev <qar...@gm...> wrote: > I haven't had a chance to look at the specific problem, but in > general, here how it is _supposed_ to work. Ahh, well that clears a few things up for me. Thanks for the explaination. Now for my proposal: Lets leave md.convert the way it is. The user has to convert to unicode first and then get unicode back in his code which he can do with as he pleases. However, I see markdown.markdown() as a shortcut for the common case. So maybe we could add some basic encoding/decoding for common cases. The user must pass in the encoding (and perhaps an optional output encoding) and, assuming the encoding actually matches (the users responsability - otherwise it fails gracefully) things work fine. If the user has a situation that doesn't fit the common case, then we would expect that the encoding/decoding will be done manually with md.convert. As long as the differances are clearly documented, that should work fine. Of course, the more I think about this, the more is feels like extra work I don't want to do. Any input? Obviously, markdownFromFile is a differant animal and should work as you proposed. > > The Markdown class is unicode-in-unicode-out. It can take a simple > string as input, but one should never pass an encoded string to it, be > it utf8 or whatever. > It's the callers responsibility to decode their text into unicode from > utf8 or whatever it is that they have it encoded as, and they can then > encode the output into whatever encoding they want. Then I got a > patch for removing BOM and integrated it without thinking, which > required passing "encoding" to it. Looking at it now I realize that > that was quite stupid. Since removeBOM() should never get encoded > strings, should _assume_ that the input is unicode, so presumably it > should suffice to have: > > def removeBOM(text, encoding): > return text.lstrip(u'\ufeff') > > In fact, we should just get rid of this function and put > text.lstrip(u'\ufeff') in the place where it is called. (BTW, should > we put it back into the output?) > > Again, if you are using markdown as a module, you should decode your > content yourself, run it through md.convert(), and then use the > resulting unicode as you wish: > > input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf16= ") > text =3D input_file.read() > html_unicode =3D Markdown.markdown(text, extensions) > output_file =3D codecs.open("test.html", "w", encoding=3D"utf8") > output_file.write(html_unicode) > > Perhaps we should raise an error if we get an encoded string? I.e., > check that either the string is of type unicode _or_ it has no special > characters. > > Markdown.markdown does have an obvious bug in that it accepts an > encoding argument and doesn't pass it to Markdown.__init__. I suppose > we should just get of this parameter altogether. > > There is also another utility function - markdownFromFile. This one > does the encoding and decoding for you. For simplicity, it uses only > one encoding argument, which is used for both decoding the input and > encoding output. I suppose that this might be confusing. Should we > add an extra argument "output_encoding" making it optional? I.e.: > > def markdownFromFile(input =3D None, > output =3D None, > extensions =3D [], > encoding =3D None, > output_encoding =3D None, > message_threshold =3D CRITICAL, > safe =3D False) : > if not output_encoding: > output encoding =3D encoding > > I must admit here that I just went to look at the documentation on the > wiki and am realizing that that's what is responsible for much of the > confusion. We have a new wiki at http://markdown.freewisdom.org/ and > I am slowly moving content there. In particular, I copied over the > content of http://markdown.freewisdom.org/Using_as_a_Module and > updated it with the example above. > > We should perhaps create a page called "BOMs" to archive there the > design decisions related to BOM removal, etc. > > - yuri > > > On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > > Kent, thanks for the info. We'll look at this further. > > > > On 10/30/07, Kent Johnson <ke...@td...> wrote: > > > Waylan Limberg wrote: > > > > Kent, > > > > > > > > Could you verify that revision 46 fixes the problem for you? > > > > > > It will fix my problem but it won't work correctly with all unicode > > > text. For example if the original text contains a BOM and it is > > > converted with utf-16be or utf-16le encoding then the unicode string > > > still contains a BOM which will not be removed by this patch. > > > > My testing shows this works with utf-16. Could you provide a simple tes= t case? > > > > > > > > Also it still seems a bit strange that the encoding argument to > > > markdown() is not used at all and the encoding argument to > > > Markdown.__init__() is the encoding that the data was in *before* it = was > > > converted to unicode. > > > > > > I would write removeBOM() as > > > > > > def removeBOM(text, encoding): > > > if isinstance(text, unicode): > > > boms =3D [u'\ufeff'] > > > else: > > > boms =3D BOMS[encoding] > > > for bom in boms: > > > if text.startswith(bom): > > > return text.lstrip(bom) > > > return text > > > > > > and I would change the rest of the code to use encoding=3DNone when t= he > > > text is actually unicode. > > > > > > Kent > > > > > > > > > > > We can thank the very smart Malcolm Tredinnick for providing a patc= h. > > > > See bug report [1817528] for more. > > > > > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote: > > > >> Hi, > > > >> > > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with = a > > > >> UnicodeDecodeError in removeBOM(): > > > >> > > > >> In [3]: import markdown > > > >> In [4]: text =3D u'\xe2'.encode('utf-8') > > > >> In [6]: print text > > > >> =E2 > > > >> In [7]: print markdown.markdown(text) > > > >> ------------------------------------------------------------ > > > >> Traceback (most recent call last): > > > >> File "<ipython console>", line 1, in <module> > > > >> File > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s= ite-packages/markdown.py", > > > >> line 1722, in markdown > > > >> return md.convert(text) > > > >> File > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s= ite-packages/markdown.py", > > > >> line 1614, in convert > > > >> self.source =3D removeBOM(self.source, self.encoding) > > > >> File > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s= ite-packages/markdown.py", > > > >> line 74, in removeBOM > > > >> if text.startswith(bom): > > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode= byte > > > >> 0xc3 in position 0: ordinal not in range(128) > > > >> > > > >> The problem is that the BOM being tested is unicode so to execute > > > >> text.startswith(bom) > > > >> Python tries to convert text to Unicode using the default encoding > > > >> (ascii). This fails because the text is not ascii. > > > >> > > > >> I'm trying to understand what the encoding parameter is for; it do= esn't > > > >> seem to do much. There also seems to be some confusion with the us= e of > > > >> encoding in markdownFromFile() vs markdown(); the file is converte= d to > > > >> Unicode on input so I don't understand why the same encoding param= eter > > > >> is passed to markdown()? > > > >> > > > >> ISTM the encoding passed to markdown should match the encoding of = the > > > >> text passed to markdown, and the values in the BOMS table should b= e in > > > >> the encoding of the key, not in unicode. Then the __unicode__() me= thod > > > >> should actually decode. Or is the intent that the text passed to > > > >> markdown() should always be ascii or unicode? > > > >> > > > >> I can put together a patch if you like but I wanted to make sure t= hat I > > > >> am not missing some grand plan... > > > >> > > > >> Kent > > > >> > > > >> ------------------------------------------------------------------= ------- > > > >> This SF.net email is sponsored by: Microsoft > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > >> _______________________________________________ > > > >> Python-markdown-discuss mailing list > > > >> Pyt...@li... > > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discu= ss > > > >> > > > > > > > > > > > > > > > > > > > > -- > > ---- > > Waylan Limberg > > wa...@gm... > > > > -----------------------------------------------------------------------= -- > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > _______________________________________________ > > Python-markdown-discuss mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > > > > -- > Yuri Takhteyev > Ph.D. Candidate, UC Berkeley School of Information > http://takhteyev.org/, http://www.freewisdom.org/ > --=20 ---- Waylan Limberg wa...@gm... |
From: Waylan L. <wa...@gm...> - 2007-10-30 20:05:19
|
On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > On 10/30/07, Yuri Takhteyev <qar...@gm...> wrote: > > I haven't had a chance to look at the specific problem, but in > > general, here how it is _supposed_ to work. > > Ahh, well that clears a few things up for me. Thanks for the explaination= . > > Now for my proposal: > > Lets leave md.convert the way it is. The user has to convert to > unicode first and then get unicode back in his code which he can do > with as he pleases. > > However, I see markdown.markdown() as a shortcut for the common case. > So maybe we could add some basic encoding/decoding for common cases. > The user must pass in the encoding (and perhaps an optional output > encoding) and, assuming the encoding actually matches (the users > responsability - otherwise it fails gracefully) things work fine. If > the user has a situation that doesn't fit the common case, then we > would expect that the encoding/decoding will be done manually with > md.convert. As long as the differances are clearly documented, that > should work fine. Of course, the more I think about this, the more is > feels like extra work I don't want to do. Any input? I should mention that I see all this happening in the markdown() function itself, not as part of Markdown. Markdown.__init__ or Markdown.convert will always get unicode. > > Obviously, markdownFromFile is a differant animal and should work as > you proposed. > > > > > > The Markdown class is unicode-in-unicode-out. It can take a simple > > string as input, but one should never pass an encoded string to it, be > > it utf8 or whatever. > > It's the callers responsibility to decode their text into unicode from > > utf8 or whatever it is that they have it encoded as, and they can then > > encode the output into whatever encoding they want. Then I got a > > patch for removing BOM and integrated it without thinking, which > > required passing "encoding" to it. Looking at it now I realize that > > that was quite stupid. Since removeBOM() should never get encoded > > strings, should _assume_ that the input is unicode, so presumably it > > should suffice to have: > > > > def removeBOM(text, encoding): > > return text.lstrip(u'\ufeff') > > > > In fact, we should just get rid of this function and put > > text.lstrip(u'\ufeff') in the place where it is called. (BTW, should > > we put it back into the output?) > > > > Again, if you are using markdown as a module, you should decode your > > content yourself, run it through md.convert(), and then use the > > resulting unicode as you wish: > > > > input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf= 16") > > text =3D input_file.read() > > html_unicode =3D Markdown.markdown(text, extensions) > > output_file =3D codecs.open("test.html", "w", encoding=3D"utf8") > > output_file.write(html_unicode) > > > > Perhaps we should raise an error if we get an encoded string? I.e., > > check that either the string is of type unicode _or_ it has no special > > characters. > > > > Markdown.markdown does have an obvious bug in that it accepts an > > encoding argument and doesn't pass it to Markdown.__init__. I suppose > > we should just get of this parameter altogether. > > > > There is also another utility function - markdownFromFile. This one > > does the encoding and decoding for you. For simplicity, it uses only > > one encoding argument, which is used for both decoding the input and > > encoding output. I suppose that this might be confusing. Should we > > add an extra argument "output_encoding" making it optional? I.e.: > > > > def markdownFromFile(input =3D None, > > output =3D None, > > extensions =3D [], > > encoding =3D None, > > output_encoding =3D None, > > message_threshold =3D CRITICAL, > > safe =3D False) : > > if not output_encoding: > > output encoding =3D encoding > > > > I must admit here that I just went to look at the documentation on the > > wiki and am realizing that that's what is responsible for much of the > > confusion. We have a new wiki at http://markdown.freewisdom.org/ and > > I am slowly moving content there. In particular, I copied over the > > content of http://markdown.freewisdom.org/Using_as_a_Module and > > updated it with the example above. > > > > We should perhaps create a page called "BOMs" to archive there the > > design decisions related to BOM removal, etc. > > > > - yuri > > > > > > On 10/30/07, Waylan Limberg <wa...@gm...> wrote: > > > Kent, thanks for the info. We'll look at this further. > > > > > > On 10/30/07, Kent Johnson <ke...@td...> wrote: > > > > Waylan Limberg wrote: > > > > > Kent, > > > > > > > > > > Could you verify that revision 46 fixes the problem for you? > > > > > > > > It will fix my problem but it won't work correctly with all unicode > > > > text. For example if the original text contains a BOM and it is > > > > converted with utf-16be or utf-16le encoding then the unicode strin= g > > > > still contains a BOM which will not be removed by this patch. > > > > > > My testing shows this works with utf-16. Could you provide a simple t= est case? > > > > > > > > > > > Also it still seems a bit strange that the encoding argument to > > > > markdown() is not used at all and the encoding argument to > > > > Markdown.__init__() is the encoding that the data was in *before* i= t was > > > > converted to unicode. > > > > > > > > I would write removeBOM() as > > > > > > > > def removeBOM(text, encoding): > > > > if isinstance(text, unicode): > > > > boms =3D [u'\ufeff'] > > > > else: > > > > boms =3D BOMS[encoding] > > > > for bom in boms: > > > > if text.startswith(bom): > > > > return text.lstrip(bom) > > > > return text > > > > > > > > and I would change the rest of the code to use encoding=3DNone when= the > > > > text is actually unicode. > > > > > > > > Kent > > > > > > > > > > > > > > We can thank the very smart Malcolm Tredinnick for providing a pa= tch. > > > > > See bug report [1817528] for more. > > > > > > > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote: > > > > >> Hi, > > > > >> > > > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails wit= h a > > > > >> UnicodeDecodeError in removeBOM(): > > > > >> > > > > >> In [3]: import markdown > > > > >> In [4]: text =3D u'\xe2'.encode('utf-8') > > > > >> In [6]: print text > > > > >> =E2 > > > > >> In [7]: print markdown.markdown(text) > > > > >> ------------------------------------------------------------ > > > > >> Traceback (most recent call last): > > > > >> File "<ipython console>", line 1, in <module> > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 1722, in markdown > > > > >> return md.convert(text) > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 1614, in convert > > > > >> self.source =3D removeBOM(self.source, self.encoding) > > > > >> File > > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5= /site-packages/markdown.py", > > > > >> line 74, in removeBOM > > > > >> if text.startswith(bom): > > > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't deco= de byte > > > > >> 0xc3 in position 0: ordinal not in range(128) > > > > >> > > > > >> The problem is that the BOM being tested is unicode so to execut= e > > > > >> text.startswith(bom) > > > > >> Python tries to convert text to Unicode using the default encodi= ng > > > > >> (ascii). This fails because the text is not ascii. > > > > >> > > > > >> I'm trying to understand what the encoding parameter is for; it = doesn't > > > > >> seem to do much. There also seems to be some confusion with the = use of > > > > >> encoding in markdownFromFile() vs markdown(); the file is conver= ted to > > > > >> Unicode on input so I don't understand why the same encoding par= ameter > > > > >> is passed to markdown()? > > > > >> > > > > >> ISTM the encoding passed to markdown should match the encoding o= f the > > > > >> text passed to markdown, and the values in the BOMS table should= be in > > > > >> the encoding of the key, not in unicode. Then the __unicode__() = method > > > > >> should actually decode. Or is the intent that the text passed to > > > > >> markdown() should always be ascii or unicode? > > > > >> > > > > >> I can put together a patch if you like but I wanted to make sure= that I > > > > >> am not missing some grand plan... > > > > >> > > > > >> Kent > > > > >> > > > > >> ----------------------------------------------------------------= --------- > > > > >> This SF.net email is sponsored by: Microsoft > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > >> _______________________________________________ > > > > >> Python-markdown-discuss mailing list > > > > >> Pyt...@li... > > > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-dis= cuss > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ---- > > > Waylan Limberg > > > wa...@gm... > > > > > > ---------------------------------------------------------------------= ---- > > > This SF.net email is sponsored by: Splunk Inc. > > > Still grepping through log files to find problems? Stop. > > > Now Search log events and configuration files using AJAX and a browse= r. > > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > _______________________________________________ > > > Python-markdown-discuss mailing list > > > Pyt...@li... > > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > > > > > > > > -- > > Yuri Takhteyev > > Ph.D. Candidate, UC Berkeley School of Information > > http://takhteyev.org/, http://www.freewisdom.org/ > > > > > -- > ---- > Waylan Limberg > wa...@gm... > --=20 ---- Waylan Limberg wa...@gm... |
From: Kent J. <ke...@td...> - 2007-10-30 20:11:43
|
Waylan Limberg wrote: > However, I see markdown.markdown() as a shortcut for the common case. > So maybe we could add some basic encoding/decoding for common cases. Seems reasonable. > The user must pass in the encoding (and perhaps an optional output > encoding) I think *two* encodings is overkill for both markdown() and markdownFromFile(). In the common case they will likely be the same and it is so easy to do the conversion yourself if you want them to be different. > and, assuming the encoding actually matches (the users > responsability - otherwise it fails gracefully) I hope by 'fails gracefully' you mean 'raises UnicodeDecodeError'. What else could you do? Start guessing encodings? > things work fine. If > the user has a situation that doesn't fit the common case, then we > would expect that the encoding/decoding will be done manually with > md.convert. As long as the differances are clearly documented, that > should work fine. Of course, the more I think about this, the more is > feels like extra work I don't want to do. Any input? It's easy. In markdown() change return md.convert(text) to if encoding is not None: text = text.decode(encoding) converted = md.convert(text) if encoding is not None: converted = converted.encode(encoding) return converted Kent |
From: Yuri T. <qar...@gm...> - 2007-10-30 20:45:23
|
> > However, I see markdown.markdown() as a shortcut for the common case. > > So maybe we could add some basic encoding/decoding for common cases. > > Seems reasonable. Well, except that what I learned the hard way while adding unicode support to MD, is that there seems to be only one "right" way to work with unicode in Python: decode when you read the file and encode when you write. Once you got encoded strings flying around, it's a recipe for problems. So, I don't want to endorse passing encoded strings as "the common case." In most cases, reading the content of a file without decoding is a bad idea and I don't want to encourage people to do that. Instead, I want to stick with a simple rule: if it's a string, then its unicode. So, I think we should offer the following functions: 1. unicode text -> unicode html 2. file path for input, encoding -> unicode html 3. file path for input, encoding, file path for output -> (writes to file) I see markdown.markdown() as doing #1. markdown.markdownFromFile() now does #3. We _could_ change it to also do #2. We could make it always return the unicode string, and also write encoded output to "output" if that argument is set. (We should probably accept either a file name or a stream as that parameter.) Now, if people feel that there is a common (if ungodly) case when the user need to deal with incoming encoded strings, I suggest we add a new method for that: markdownFromEncodedString() which will do decoding and return unicode. Though, in that case it should really be enough to write markdown.markdown(unicode(my_ungodly_string, "utf8")) So, I am not sure if such a method is really needed. > I think *two* encodings is overkill for both markdown() and > markdownFromFile(). In the common case they will likely be the same and > it is so easy to do the conversion yourself if you want them to be > different. Again, markdown() will no longer have encoding. As to the second, I tend to agree, especially if markdownFromFile could return the unicode instead of writing it to a file. > I hope by 'fails gracefully' you mean 'raises UnicodeDecodeError'. What > else could you do? Start guessing encodings? I think we should raise an error. The only question is: should we return a better error message. > if encoding is not None: > text = text.decode(encoding) > converted = md.convert(text) > if encoding is not None: > converted = converted.encode(encoding) > return converted Again, I would really rather stick with a simple rule of "files are encoded, strings are unicode" and banish encoded strings completely. Otherwise keeping track of what is and what is not unicode becomes a huge headache. It also becomes hard to explain to other people what exactly we are doing. The only place where .encode() appears now is in sys.stdout.write(new_text.encode(encoding)) Note that in this case I do the conversion without saving the encoded string on purpose. If sys.stdout.write wants an encoded string, that's fine - I'll give it to it, but I don't want to have any encoded strings sticking around. If I had to keep them for any reasons, I would make sure to prefix them with "encoded_" - yuri -- Yuri Takhteyev http://www.freewisdom.org/ |
From: Kent J. <ke...@td...> - 2007-10-30 22:05:12
|
Yuri Takhteyev wrote: > I want to stick with a simple rule: if it's a string, then > its unicode. > > So, I think we should offer the following functions: > > 1. unicode text -> unicode html Hmm...one problem with this (and Waylan's suggestion of making the encoding parameter to markdown() do something useful) is that until 1.6b, markdown() did in fact work perfectly well with encoded text and it was not at all clear that this was not the intended usage. When 1.6b came out I just commented out the call to removeBOM(), complained to the list, and continued on my way. I use markdown from Django with the markdown support included with Django; presumably many other people are also. For example: http://www.freewisdom.org/projects/python-markdown/Django which is based on this post by Waylan: http://achinghead.com/archive/70/django-blog-and-markdown/ which is pretty close to the current form of the Django markdown filter. Kent |
From: Yuri T. <qar...@gm...> - 2007-10-30 22:34:57
|
> Hmm...one problem with this (and Waylan's suggestion of making the > encoding parameter to markdown() do something useful) is that until > 1.6b, markdown() did in fact work perfectly well with encoded text and > it was not at all clear that this was not the intended usage. When 1.6b > came out I just commented out the call to removeBOM(), complained to the > list, and continued on my way. Good point... But I think that was a mistake, which needs to be corrected. In the very least, I don't want any new users to use it that way. So, the question is: what would be a good balance between fixing this problem and not screwing existing users? I suggest releasing 1.7 with all of Waylan's recent fixes and this change and putting a clear message in release notes that in 1.7 markdown.markdown() expects unicode and that if you've got utf8-encoded strings, then you should call it with markdown.markdown(input.encode("utf8")) > I use markdown from Django with the markdown support included with > Django; presumably many other people are also. For example: > http://www.freewisdom.org/projects/python-markdown/Django I haven't touched Django for some time, so I am not sure what it does with unicode today. I remember that in December 2006 it was a mess. At that time they did pass encoded bytestrings around. At this point they seem to give you an option of either using bytestrings or unicode (http://www.djangoproject.com/documentation/unicode/) I think the thing to do here is to write a new version of the plugin, which would check if the input is unicode, and if not would decode it from utf8 before sending it to markdown. People who update the 1.7 will also need to update to the new plugin, which doesn't seem to be so bad. We should probably also send the new plugin to the Django team and ask them to include it instead of the old one. (BTW, does Django actually include markdown or just the plugin?) Perhaps the django plugin should be included with markdown release? We should probably also put it in SVN. - yuri -- Yuri Takhteyev http://www.freewisdom.org/ |
From: Waylan L. <wa...@gm...> - 2007-10-31 01:36:18
|
On 10/30/07, Kent Johnson <ke...@td...> wrote: > Yuri Takhteyev wrote: > > > I want to stick with a simple rule: if it's a string, then > > its unicode. > > > > So, I think we should offer the following functions: > > > > 1. unicode text -> unicode html > > Hmm...one problem with this (and Waylan's suggestion of making the > encoding parameter to markdown() do something useful) is that until > 1.6b, markdown() did in fact work perfectly well with encoded text and > it was not at all clear that this was not the intended usage. When 1.6b > came out I just commented out the call to removeBOM(), complained to the > list, and continued on my way. > > I use markdown from Django with the markdown support included with > Django; presumably many other people are also. For example: > http://www.freewisdom.org/projects/python-markdown/Django > > which is based on this post by Waylan: > http://achinghead.com/archive/70/django-blog-and-markdown/ > > which is pretty close to the current form of the Django markdown filter. > You almost have a point. In fact, I was about to make the same argument. Then I remembered that that was before the unicode branch was merged in Django. Ticket 2910 [1] needs to be updated for this and hasn't. Well, the latest patch does try to address it, but I was never convinced it was right. Yuri's clarifications make it clear what needs to happen in that patch. We make sure we have unicode to pass in (Django has the mechanisms to force the issue) and so we should always get unicode out. [1]: http://code.djangoproject.com/attachment/ticket/2910/2910-2.diff BTW, I consider ticket 2910 the most up-to-date approach to Django integration. The Markdown docs should probably be updated I suppose those still using pre-unicode versions of Django could have issues. But if your not updating Django, then I wouldn't expect you to update its dependencies either.. -- ---- Waylan Limberg wa...@gm... |