Re: [Python-markdown-discuss] Markdown encoding

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> > However, I see markdown.markdown() as a shortcut for the common case.
> > So maybe we could add some basic encoding/decoding for common cases.
>
> Seems reasonable.

Well, except that what I learned the hard way while adding unicode
support to MD, is that there seems to be only one "right" way to work
with unicode in Python: decode when you read the file and encode when
you write.  Once you got encoded strings flying around, it's a recipe
for problems.  So, I don't want to endorse passing encoded strings as
"the common case."  In most cases, reading the content of a file
without decoding is a bad idea and I don't want to encourage people to
do that.

Instead, I want to stick with a simple rule: if it's a string, then
its unicode.

So, I think we should offer the following functions:

1. unicode text -> unicode html
2. file path for input, encoding -> unicode html
3. file path for input, encoding, file path for output -> (writes to file)

I see markdown.markdown() as doing #1.  markdown.markdownFromFile()
now does #3.  We _could_ change it to also do #2.  We could make it
always return the unicode string, and also write encoded output to
"output" if that argument is set.  (We should probably accept either a
file name or a stream as that parameter.)

Now, if people feel that there is a common (if ungodly) case when the
user need to deal with incoming encoded strings, I suggest we add a
new method for that: markdownFromEncodedString() which will do
decoding and return unicode.  Though, in that case it should really be
enough to write

    markdown.markdown(unicode(my_ungodly_string, "utf8"))

So, I am not sure if such a method is really needed.

> I think *two* encodings is overkill for both markdown() and
> markdownFromFile(). In the common case they will likely be the same and
> it is so easy to do the conversion yourself if you want them to be
> different.

Again, markdown() will no longer have encoding.  As to the second, I
tend to agree, especially if markdownFromFile could return the unicode
instead of writing it to a file.

> I hope by 'fails gracefully' you mean 'raises UnicodeDecodeError'. What
> else could you do? Start guessing encodings?

I think we should raise an error.  The only question is: should we
return a better error message.

>      if encoding is not None:
>          text = text.decode(encoding)
>      converted = md.convert(text)
>      if encoding is not None:
>          converted = converted.encode(encoding)
>      return converted

Again, I would really rather stick with a simple rule of "files are
encoded, strings are unicode" and banish encoded strings completely.
Otherwise keeping track of what is and what is not unicode becomes a
huge headache.  It also becomes hard to explain to other people what
exactly we are doing.  The only place where .encode() appears now is
in

    sys.stdout.write(new_text.encode(encoding))

Note that in this case I do the conversion without saving the encoded
string on purpose.  If sys.stdout.write wants an encoded string,
that's fine - I'll give it to it, but I don't want to have any encoded
strings sticking around. If I had to keep them for any reasons, I
would make sure to prefix them with "encoded_"

- yuri

-- 
Yuri Takhteyev
http://www.freewisdom.org/