Re: [Python-markdown-discuss] Markdown encoding

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 10/30/07, Waylan Limberg <wa...@gm...> wrote:
> On 10/30/07, Yuri Takhteyev <qar...@gm...> wrote:
> > I haven't had a chance to look at the specific problem, but in
> > general, here how it is _supposed_ to work.
>
> Ahh, well that clears a few things up for me. Thanks for the explaination=
.
>
> Now for my proposal:
>
> Lets leave md.convert the way it is. The user has to convert to
> unicode first and then get unicode back in his code which he can do
> with as he pleases.
>
> However, I see markdown.markdown() as a shortcut for the common case.
> So maybe we could add some basic encoding/decoding for common cases.
> The user must pass in the encoding (and perhaps an optional output
> encoding) and, assuming the encoding actually matches (the users
> responsability - otherwise it fails gracefully) things work fine. If
> the user has a situation that doesn't fit the common case, then we
> would expect that the encoding/decoding will be done manually with
> md.convert. As long as the differances are clearly documented, that
> should work fine. Of course, the more I think about this, the more is
> feels like extra work I don't want to do. Any input?

I should mention that I see all this happening in the markdown()
function itself, not as part of Markdown. Markdown.__init__ or
Markdown.convert will always get unicode.

>
> Obviously, markdownFromFile is a differant animal and should work as
> you proposed.
>
>
> >
> > The Markdown class is unicode-in-unicode-out.   It can take a simple
> > string as input, but one should never pass an encoded string to it, be
> > it utf8 or whatever.
> > It's the callers responsibility to decode their text into unicode from
> > utf8 or whatever it is that they have it encoded as, and they can then
> > encode the output into whatever encoding they want.  Then I got a
> > patch for removing BOM and integrated it without thinking, which
> > required passing "encoding" to it.  Looking at it now I realize that
> > that was quite stupid.  Since removeBOM() should never get encoded
> > strings, should _assume_ that the input is unicode, so presumably it
> > should suffice to have:
> >
> >     def removeBOM(text, encoding):
> >          return text.lstrip(u'\ufeff')
> >
> > In fact, we should just get rid of this function and put
> > text.lstrip(u'\ufeff') in the place where it is called.  (BTW, should
> > we put it back into the output?)
> >
> > Again, if you are using markdown as a module, you should decode your
> > content yourself, run it through md.convert(), and then use the
> > resulting unicode as you wish:
> >
> >      input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf=
16")
> >      text =3D input_file.read()
> >      html_unicode =3D Markdown.markdown(text, extensions)
> >      output_file =3D codecs.open("test.html", "w", encoding=3D"utf8")
> >      output_file.write(html_unicode)
> >
> > Perhaps we should raise an error if we get an encoded string?  I.e.,
> > check that either the string is of type unicode _or_ it has no special
> > characters.
> >
> > Markdown.markdown does have an obvious bug in that it accepts an
> > encoding argument and doesn't pass it to Markdown.__init__.  I suppose
> > we should just get of this parameter altogether.
> >
> > There is also another utility function - markdownFromFile.  This one
> > does the encoding and decoding for you.  For simplicity, it uses only
> > one encoding argument, which is used for both decoding the input and
> > encoding output.  I suppose that this might be confusing.  Should we
> > add an extra argument "output_encoding" making it optional?  I.e.:
> >
> >     def markdownFromFile(input =3D None,
> >                          output =3D None,
> >                          extensions =3D [],
> >                          encoding =3D None,
> >                          output_encoding =3D None,
> >                          message_threshold =3D CRITICAL,
> >                          safe =3D False) :
> >         if not output_encoding:
> >            output encoding =3D encoding
> >
> > I must admit here that I just went to look at the documentation on the
> > wiki and am realizing that that's what is responsible for much of the
> > confusion.  We have a new wiki at http://markdown.freewisdom.org/ and
> > I am slowly moving content there.  In particular, I copied over the
> > content of http://markdown.freewisdom.org/Using_as_a_Module and
> > updated it with the example above.
> >
> > We should perhaps create a page called "BOMs" to archive there the
> > design decisions related to BOM removal, etc.
> >
> >   - yuri
> >
> >
> > On 10/30/07, Waylan Limberg <wa...@gm...> wrote:
> > > Kent, thanks for the info. We'll look at this further.
> > >
> > > On 10/30/07, Kent Johnson <ke...@td...> wrote:
> > > > Waylan Limberg wrote:
> > > > > Kent,
> > > > >
> > > > > Could you verify that revision 46 fixes the problem for you?
> > > >
> > > > It will fix my problem but it won't work correctly with all unicode
> > > > text. For example if the original text contains a BOM and it is
> > > > converted with utf-16be or utf-16le encoding then the unicode strin=
g
> > > > still contains a BOM which will not be removed by this patch.
> > >
> > > My testing shows this works with utf-16. Could you provide a simple t=
est case?
> > >
> > > >
> > > > Also it still seems a bit strange that the encoding argument to
> > > > markdown() is not used at all and the encoding argument to
> > > > Markdown.__init__() is the encoding that the data was in *before* i=
t was
> > > > converted to unicode.
> > > >
> > > > I would write removeBOM() as
> > > >
> > > >   def removeBOM(text, encoding):
> > > >       if isinstance(text, unicode):
> > > >           boms =3D [u'\ufeff']
> > > >       else:
> > > >           boms =3D BOMS[encoding]
> > > >       for bom in boms:
> > > >           if text.startswith(bom):
> > > >               return text.lstrip(bom)
> > > >       return text
> > > >
> > > > and I would change the rest of the code to use encoding=3DNone when=
 the
> > > > text is actually unicode.
> > > >
> > > > Kent
> > > >
> > > > >
> > > > > We can thank the very smart Malcolm Tredinnick for providing a pa=
tch.
> > > > > See bug report [1817528] for more.
> > > > >
> > > > > On 9/12/07, Kent Johnson <ke...@td...> wrote:
> > > > >> Hi,
> > > > >>
> > > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails wit=
h a
> > > > >> UnicodeDecodeError in removeBOM():
> > > > >>
> > > > >> In [3]: import markdown
> > > > >> In [4]: text =3D u'\xe2'.encode('utf-8')
> > > > >> In [6]: print text
> > > > >> =E2
> > > > >> In [7]: print markdown.markdown(text)
> > > > >> ------------------------------------------------------------
> > > > >> Traceback (most recent call last):
> > > > >>    File "<ipython console>", line 1, in <module>
> > > > >>    File
> > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5=
/site-packages/markdown.py",
> > > > >> line 1722, in markdown
> > > > >>      return md.convert(text)
> > > > >>    File
> > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5=
/site-packages/markdown.py",
> > > > >> line 1614, in convert
> > > > >>      self.source =3D removeBOM(self.source, self.encoding)
> > > > >>    File
> > > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5=
/site-packages/markdown.py",
> > > > >> line 74, in removeBOM
> > > > >>      if text.startswith(bom):
> > > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't deco=
de byte
> > > > >> 0xc3 in position 0: ordinal not in range(128)
> > > > >>
> > > > >> The problem is that the BOM being tested is unicode so to execut=
e
> > > > >>    text.startswith(bom)
> > > > >> Python tries to convert text to Unicode using the default encodi=
ng
> > > > >> (ascii). This fails because the text is not ascii.
> > > > >>
> > > > >> I'm trying to understand what the encoding parameter is for; it =
doesn't
> > > > >> seem to do much. There also seems to be some confusion with the =
use of
> > > > >> encoding in markdownFromFile() vs markdown(); the file is conver=
ted to
> > > > >> Unicode on input so I don't understand why the same encoding par=
ameter
> > > > >> is passed to markdown()?
> > > > >>
> > > > >> ISTM the encoding passed to markdown should match the encoding o=
f the
> > > > >> text passed to markdown, and the values in the BOMS table should=
 be in
> > > > >> the encoding of the key, not in unicode. Then the __unicode__() =
method
> > > > >> should actually decode. Or is the intent that the text passed to
> > > > >> markdown() should always be ascii or unicode?
> > > > >>
> > > > >> I can put together a patch if you like but I wanted to make sure=
 that I
> > > > >> am not missing some grand plan...
> > > > >>
> > > > >> Kent
> > > > >>
> > > > >> ----------------------------------------------------------------=
---------
> > > > >> This SF.net email is sponsored by: Microsoft
> > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > >> _______________________________________________
> > > > >> Python-markdown-discuss mailing list
> > > > >> Pyt...@li...
> > > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-dis=
cuss
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > ----
> > > Waylan Limberg
> > > wa...@gm...
> > >
> > > ---------------------------------------------------------------------=
----
> > > This SF.net email is sponsored by: Splunk Inc.
> > > Still grepping through log files to find problems?  Stop.
> > > Now Search log events and configuration files using AJAX and a browse=
r.
> > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > _______________________________________________
> > > Python-markdown-discuss mailing list
> > > Pyt...@li...
> > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
> > >
> >
> >
> > --
> > Yuri Takhteyev
> > Ph.D. Candidate, UC Berkeley School of Information
> > http://takhteyev.org/, http://www.freewisdom.org/
> >
>
>
> --
> ----
> Waylan Limberg
> wa...@gm...
>

--=20
----
Waylan Limberg
wa...@gm...