Re: [Python-markdown-discuss] Markdown encoding

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 10/30/07, Yuri Takhteyev <qar...@gm...> wrote:
> I haven't had a chance to look at the specific problem, but in
> general, here how it is _supposed_ to work.

Ahh, well that clears a few things up for me. Thanks for the explaination.

Now for my proposal:

Lets leave md.convert the way it is. The user has to convert to
unicode first and then get unicode back in his code which he can do
with as he pleases.

However, I see markdown.markdown() as a shortcut for the common case.
So maybe we could add some basic encoding/decoding for common cases.
The user must pass in the encoding (and perhaps an optional output
encoding) and, assuming the encoding actually matches (the users
responsability - otherwise it fails gracefully) things work fine. If
the user has a situation that doesn't fit the common case, then we
would expect that the encoding/decoding will be done manually with
md.convert. As long as the differances are clearly documented, that
should work fine. Of course, the more I think about this, the more is
feels like extra work I don't want to do. Any input?

Obviously, markdownFromFile is a differant animal and should work as
you proposed.

>
> The Markdown class is unicode-in-unicode-out.   It can take a simple
> string as input, but one should never pass an encoded string to it, be
> it utf8 or whatever.
> It's the callers responsibility to decode their text into unicode from
> utf8 or whatever it is that they have it encoded as, and they can then
> encode the output into whatever encoding they want.  Then I got a
> patch for removing BOM and integrated it without thinking, which
> required passing "encoding" to it.  Looking at it now I realize that
> that was quite stupid.  Since removeBOM() should never get encoded
> strings, should _assume_ that the input is unicode, so presumably it
> should suffice to have:
>
>     def removeBOM(text, encoding):
>          return text.lstrip(u'\ufeff')
>
> In fact, we should just get rid of this function and put
> text.lstrip(u'\ufeff') in the place where it is called.  (BTW, should
> we put it back into the output?)
>
> Again, if you are using markdown as a module, you should decode your
> content yourself, run it through md.convert(), and then use the
> resulting unicode as you wish:
>
>      input_file =3D codecs.open("test.txt", mode=3D"r", encoding=3D"utf16=
")
>      text =3D input_file.read()
>      html_unicode =3D Markdown.markdown(text, extensions)
>      output_file =3D codecs.open("test.html", "w", encoding=3D"utf8")
>      output_file.write(html_unicode)
>
> Perhaps we should raise an error if we get an encoded string?  I.e.,
> check that either the string is of type unicode _or_ it has no special
> characters.
>
> Markdown.markdown does have an obvious bug in that it accepts an
> encoding argument and doesn't pass it to Markdown.__init__.  I suppose
> we should just get of this parameter altogether.
>
> There is also another utility function - markdownFromFile.  This one
> does the encoding and decoding for you.  For simplicity, it uses only
> one encoding argument, which is used for both decoding the input and
> encoding output.  I suppose that this might be confusing.  Should we
> add an extra argument "output_encoding" making it optional?  I.e.:
>
>     def markdownFromFile(input =3D None,
>                          output =3D None,
>                          extensions =3D [],
>                          encoding =3D None,
>                          output_encoding =3D None,
>                          message_threshold =3D CRITICAL,
>                          safe =3D False) :
>         if not output_encoding:
>            output encoding =3D encoding
>
> I must admit here that I just went to look at the documentation on the
> wiki and am realizing that that's what is responsible for much of the
> confusion.  We have a new wiki at http://markdown.freewisdom.org/ and
> I am slowly moving content there.  In particular, I copied over the
> content of http://markdown.freewisdom.org/Using_as_a_Module and
> updated it with the example above.
>
> We should perhaps create a page called "BOMs" to archive there the
> design decisions related to BOM removal, etc.
>
>   - yuri
>
>
> On 10/30/07, Waylan Limberg <wa...@gm...> wrote:
> > Kent, thanks for the info. We'll look at this further.
> >
> > On 10/30/07, Kent Johnson <ke...@td...> wrote:
> > > Waylan Limberg wrote:
> > > > Kent,
> > > >
> > > > Could you verify that revision 46 fixes the problem for you?
> > >
> > > It will fix my problem but it won't work correctly with all unicode
> > > text. For example if the original text contains a BOM and it is
> > > converted with utf-16be or utf-16le encoding then the unicode string
> > > still contains a BOM which will not be removed by this patch.
> >
> > My testing shows this works with utf-16. Could you provide a simple tes=
t case?
> >
> > >
> > > Also it still seems a bit strange that the encoding argument to
> > > markdown() is not used at all and the encoding argument to
> > > Markdown.__init__() is the encoding that the data was in *before* it =
was
> > > converted to unicode.
> > >
> > > I would write removeBOM() as
> > >
> > >   def removeBOM(text, encoding):
> > >       if isinstance(text, unicode):
> > >           boms =3D [u'\ufeff']
> > >       else:
> > >           boms =3D BOMS[encoding]
> > >       for bom in boms:
> > >           if text.startswith(bom):
> > >               return text.lstrip(bom)
> > >       return text
> > >
> > > and I would change the rest of the code to use encoding=3DNone when t=
he
> > > text is actually unicode.
> > >
> > > Kent
> > >
> > > >
> > > > We can thank the very smart Malcolm Tredinnick for providing a patc=
h.
> > > > See bug report [1817528] for more.
> > > >
> > > > On 9/12/07, Kent Johnson <ke...@td...> wrote:
> > > >> Hi,
> > > >>
> > > >> Markdown 1.6b doesn't work with UTF-8-encoded text. It fails with =
a
> > > >> UnicodeDecodeError in removeBOM():
> > > >>
> > > >> In [3]: import markdown
> > > >> In [4]: text =3D u'\xe2'.encode('utf-8')
> > > >> In [6]: print text
> > > >> =E2
> > > >> In [7]: print markdown.markdown(text)
> > > >> ------------------------------------------------------------
> > > >> Traceback (most recent call last):
> > > >>    File "<ipython console>", line 1, in <module>
> > > >>    File
> > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s=
ite-packages/markdown.py",
> > > >> line 1722, in markdown
> > > >>      return md.convert(text)
> > > >>    File
> > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s=
ite-packages/markdown.py",
> > > >> line 1614, in convert
> > > >>      self.source =3D removeBOM(self.source, self.encoding)
> > > >>    File
> > > >> "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/s=
ite-packages/markdown.py",
> > > >> line 74, in removeBOM
> > > >>      if text.startswith(bom):
> > > >> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode=
 byte
> > > >> 0xc3 in position 0: ordinal not in range(128)
> > > >>
> > > >> The problem is that the BOM being tested is unicode so to execute
> > > >>    text.startswith(bom)
> > > >> Python tries to convert text to Unicode using the default encoding
> > > >> (ascii). This fails because the text is not ascii.
> > > >>
> > > >> I'm trying to understand what the encoding parameter is for; it do=
esn't
> > > >> seem to do much. There also seems to be some confusion with the us=
e of
> > > >> encoding in markdownFromFile() vs markdown(); the file is converte=
d to
> > > >> Unicode on input so I don't understand why the same encoding param=
eter
> > > >> is passed to markdown()?
> > > >>
> > > >> ISTM the encoding passed to markdown should match the encoding of =
the
> > > >> text passed to markdown, and the values in the BOMS table should b=
e in
> > > >> the encoding of the key, not in unicode. Then the __unicode__() me=
thod
> > > >> should actually decode. Or is the intent that the text passed to
> > > >> markdown() should always be ascii or unicode?
> > > >>
> > > >> I can put together a patch if you like but I wanted to make sure t=
hat I
> > > >> am not missing some grand plan...
> > > >>
> > > >> Kent
> > > >>
> > > >> ------------------------------------------------------------------=
-------
> > > >> This SF.net email is sponsored by: Microsoft
> > > >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >> _______________________________________________
> > > >> Python-markdown-discuss mailing list
> > > >> Pyt...@li...
> > > >> https://lists.sourceforge.net/lists/listinfo/python-markdown-discu=
ss
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > ----
> > Waylan Limberg
> > wa...@gm...
> >
> > -----------------------------------------------------------------------=
--
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > Python-markdown-discuss mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
> >
>
>
> --
> Yuri Takhteyev
> Ph.D. Candidate, UC Berkeley School of Information
> http://takhteyev.org/, http://www.freewisdom.org/
>

--=20
----
Waylan Limberg
wa...@gm...