|
From: Waylan L. <wa...@gm...> - 2007-10-30 16:51:33
|
Kent, thanks for the info. We'll look at this further.
On 10/30/07, Kent Johnson <ke...@td...> wrote:
> Waylan Limberg wrote:
> > Kent,
> >
> > Could you verify that revision 46 fixes the problem for you?
>
> It will fix my problem but it won't work correctly with all unicode
> text. For example if the original text contains a BOM and it is
> converted with utf-16be or utf-16le encoding then the unicode string
> still contains a BOM which will not be removed by this patch.
My testing shows this works with utf-16. Could you provide a simple test ca=
se?
>
> Also it still seems a bit strange that the encoding argument to
> markdown() is not used at all and the encoding argument to
> Markdown.__init__() is the encoding that the data was in *before* it was
> converted to unicode.
>
> I would write removeBOM() as
>
> def removeBOM(text, encoding):
> if isinstance(text, unicode):
> boms =3D [u'\ufeff']
> else:
> boms =3D BOMS[encoding]
> for bom in boms:
> if text.startswith(bom):
> return text.lstrip(bom)
> return text
>
> and I would change the rest of the code to use encoding=3DNone when the
> text is actually unicode.
>
> Kent
>
> >
> > We can thank the very smart Malcolm Tredinnick for providing a patch.
> > See bug report [1817528] for more.
> >
> > On 9/12/07, Kent Johnson <ke...@td...> wrote:
> >> Hi,
> >>
> >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a
> >> UnicodeDecodeError in removeBOM():
> >>
> >> In [3]: import markdown
> >> In [4]: text =3D u'\xe2'.encode('utf-8')
> >> In [6]: print text
> >> =E2
> >> In [7]: print markdown.markdown(text)
> >> ------------------------------------------------------------
> >> Traceback (most recent call last):
> >> File "<ipython console>", line 1, in <module>
> >> File
> >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-=
packages/markdown.py",
> >> line 1722, in markdown
> >> return md.convert(text)
> >> File
> >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-=
packages/markdown.py",
> >> line 1614, in convert
> >> self.source =3D removeBOM(self.source, self.encoding)
> >> File
> >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-=
packages/markdown.py",
> >> line 74, in removeBOM
> >> if text.startswith(bom):
> >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byt=
e
> >> 0xc3 in position 0: ordinal not in range(128)
> >>
> >> The problem is that the BOM being tested is unicode so to execute
> >> text.startswith(bom)
> >> Python tries to convert text to Unicode using the default encoding
> >> (ascii). This fails because the text is not ascii.
> >>
> >> I'm trying to understand what the encoding parameter is for; it doesn'=
t
> >> seem to do much. There also seems to be some confusion with the use of
> >> encoding in markdownFromFile() vs markdown(); the file is converted to
> >> Unicode on input so I don't understand why the same encoding parameter
> >> is passed to markdown()?
> >>
> >> ISTM the encoding passed to markdown should match the encoding of the
> >> text passed to markdown, and the values in the BOMS table should be in
> >> the encoding of the key, not in unicode. Then the __unicode__() method
> >> should actually decode. Or is the intent that the text passed to
> >> markdown() should always be ascii or unicode?
> >>
> >> I can put together a patch if you like but I wanted to make sure that =
I
> >> am not missing some grand plan...
> >>
> >> Kent
> >>
> >> ----------------------------------------------------------------------=
---
> >> This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> Python-markdown-discuss mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
> >>
> >
> >
>
>
--=20
----
Waylan Limberg
wa...@gm...
|