|
From: Yuri T. <qar...@gm...> - 2007-10-30 19:28:35
|
I haven't had a chance to look at the specific problem, but in
general, here how it is _supposed_ to work.
The Markdown class is unicode-in-unicode-out. It can take a simple
string as input, but one should never pass an encoded string to it, be
it utf8 or whatever.
It's the callers responsibility to decode their text into unicode from
utf8 or whatever it is that they have it encoded as, and they can then
encode the output into whatever encoding they want. Then I got a
patch for removing BOM and integrated it without thinking, which
required passing "encoding" to it. Looking at it now I realize that
that was quite stupid. Since removeBOM() should never get encoded
strings, should _assume_ that the input is unicode, so presumably it
should suffice to have:
def removeBOM(text, encoding):
return text.lstrip(u'\ufeff')
In fact, we should just get rid of this function and put
text.lstrip(u'\ufeff') in the place where it is called. (BTW, should
we put it back into the output?)
Again, if you are using markdown as a module, you should decode your
content yourself, run it through md.convert(), and then use the
resulting unicode as you wish:
input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf16")
text =3D input_file.read()
html_unicode =3D Markdown.markdown(text, extensions)
output_file =3D codecs.open("test.html", "w", encoding=3D"utf8")
output_file.write(html_unicode)
Perhaps we should raise an error if we get an encoded string? I.e.,
check that either the string is of type unicode _or_ it has no special
characters.
Markdown.markdown does have an obvious bug in that it accepts an
encoding argument and doesn't pass it to Markdown.__init__. I suppose
we should just get of this parameter altogether.
There is also another utility function - markdownFromFile. This one
does the encoding and decoding for you. For simplicity, it uses only
one encoding argument, which is used for both decoding the input and
encoding output. I suppose that this might be confusing. Should we
add an extra argument "output_encoding" making it optional? I.e.:
def markdownFromFile(input =3D None,
output =3D None,
extensions =3D [],
encoding =3D None,
output_encoding =3D None,
message_threshold =3D CRITICAL,
safe =3D False) :
if not output_encoding:
output encoding =3D encoding
I must admit here that I just went to look at the documentation on the
wiki and am realizing that that's what is responsible for much of the
confusion. We have a new wiki at http://markdown.freewisdom.org/ and
I am slowly moving content there. In particular, I copied over the
content of http://markdown.freewisdom.org/Using_as_a_Module and
updated it with the example above.
We should perhaps create a page called "BOMs" to archive there the
design decisions related to BOM removal, etc.
- yuri
On 10/30/07, Waylan Limberg <wa...@gm...> wrote:
> Kent, thanks for the info. We'll look at this further.
>
> On 10/30/07, Kent Johnson <ke...@td...> wrote:
> > Waylan Limberg wrote:
> > > Kent,
> > >
> > > Could you verify that revision 46 fixes the problem for you?
> >
> > It will fix my problem but it won't work correctly with all unicode
> > text. For example if the original text contains a BOM and it is
> > converted with utf-16be or utf-16le encoding then the unicode string
> > still contains a BOM which will not be removed by this patch.
>
> My testing shows this works with utf-16. Could you provide a simple test =
case?
>
> >
> > Also it still seems a bit strange that the encoding argument to
> > markdown() is not used at all and the encoding argument to
> > Markdown.__init__() is the encoding that the data was in *before* it wa=
s
> > converted to unicode.
> >
> > I would write removeBOM() as
> >
> > def removeBOM(text, encoding):
> > if isinstance(text, unicode):
> > boms =3D [u'\ufeff']
> > else:
> > boms =3D BOMS[encoding]
> > for bom in boms:
> > if text.startswith(bom):
> > return text.lstrip(bom)
> > return text
> >
> > and I would change the rest of the code to use encoding=3DNone when the
> > text is actually unicode.
> >
> > Kent
> >
> > >
> > > We can thank the very smart Malcolm Tredinnick for providing a patch.
> > > See bug report [1817528] for more.
> > >
> > > On 9/12/07, Kent Johnson <ke...@td...> wrote:
> > >> Hi,
> > >>
> > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with a
> > >> UnicodeDecodeError in removeBOM():
> > >>
> > >> In [3]: import markdown
> > >> In [4]: text =3D u'\xe2'.encode('utf-8')
> > >> In [6]: print text
> > >> =E2
> > >> In [7]: print markdown.markdown(text)
> > >> ------------------------------------------------------------
> > >> Traceback (most recent call last):
> > >> File "<ipython console>", line 1, in <module>
> > >> File
> > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit=
e-packages/markdown.py",
> > >> line 1722, in markdown
> > >> return md.convert(text)
> > >> File
> > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit=
e-packages/markdown.py",
> > >> line 1614, in convert
> > >> self.source =3D removeBOM(self.source, self.encoding)
> > >> File
> > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sit=
e-packages/markdown.py",
> > >> line 74, in removeBOM
> > >> if text.startswith(bom):
> > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode b=
yte
> > >> 0xc3 in position 0: ordinal not in range(128)
> > >>
> > >> The problem is that the BOM being tested is unicode so to execute
> > >> text.startswith(bom)
> > >> Python tries to convert text to Unicode using the default encoding
> > >> (ascii). This fails because the text is not ascii.
> > >>
> > >> I'm trying to understand what the encoding parameter is for; it does=
n't
> > >> seem to do much. There also seems to be some confusion with the use =
of
> > >> encoding in markdownFromFile() vs markdown(); the file is converted =
to
> > >> Unicode on input so I don't understand why the same encoding paramet=
er
> > >> is passed to markdown()?
> > >>
> > >> ISTM the encoding passed to markdown should match the encoding of th=
e
> > >> text passed to markdown, and the values in the BOMS table should be =
in
> > >> the encoding of the key, not in unicode. Then the __unicode__() meth=
od
> > >> should actually decode. Or is the intent that the text passed to
> > >> markdown() should always be ascii or unicode?
> > >>
> > >> I can put together a patch if you like but I wanted to make sure tha=
t I
> > >> am not missing some grand plan...
> > >>
> > >> Kent
> > >>
> > >> --------------------------------------------------------------------=
-----
> > >> This SF.net email is sponsored by: Microsoft
> > >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >> _______________________________________________
> > >> Python-markdown-discuss mailing list
> > >> Pyt...@li...
> > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
> > >>
> > >
> > >
> >
> >
>
>
> --
> ----
> Waylan Limberg
> wa...@gm...
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Python-markdown-discuss mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
>
--=20
Yuri Takhteyev
Ph.D. Candidate, UC Berkeley School of Information
http://takhteyev.org/, http://www.freewisdom.org/
|