|
From: Yuri T. <qar...@gm...> - 2007-10-30 20:45:23
|
> > However, I see markdown.markdown() as a shortcut for the common case.
> > So maybe we could add some basic encoding/decoding for common cases.
>
> Seems reasonable.
Well, except that what I learned the hard way while adding unicode
support to MD, is that there seems to be only one "right" way to work
with unicode in Python: decode when you read the file and encode when
you write. Once you got encoded strings flying around, it's a recipe
for problems. So, I don't want to endorse passing encoded strings as
"the common case." In most cases, reading the content of a file
without decoding is a bad idea and I don't want to encourage people to
do that.
Instead, I want to stick with a simple rule: if it's a string, then
its unicode.
So, I think we should offer the following functions:
1. unicode text -> unicode html
2. file path for input, encoding -> unicode html
3. file path for input, encoding, file path for output -> (writes to file)
I see markdown.markdown() as doing #1. markdown.markdownFromFile()
now does #3. We _could_ change it to also do #2. We could make it
always return the unicode string, and also write encoded output to
"output" if that argument is set. (We should probably accept either a
file name or a stream as that parameter.)
Now, if people feel that there is a common (if ungodly) case when the
user need to deal with incoming encoded strings, I suggest we add a
new method for that: markdownFromEncodedString() which will do
decoding and return unicode. Though, in that case it should really be
enough to write
markdown.markdown(unicode(my_ungodly_string, "utf8"))
So, I am not sure if such a method is really needed.
> I think *two* encodings is overkill for both markdown() and
> markdownFromFile(). In the common case they will likely be the same and
> it is so easy to do the conversion yourself if you want them to be
> different.
Again, markdown() will no longer have encoding. As to the second, I
tend to agree, especially if markdownFromFile could return the unicode
instead of writing it to a file.
> I hope by 'fails gracefully' you mean 'raises UnicodeDecodeError'. What
> else could you do? Start guessing encodings?
I think we should raise an error. The only question is: should we
return a better error message.
> if encoding is not None:
> text = text.decode(encoding)
> converted = md.convert(text)
> if encoding is not None:
> converted = converted.encode(encoding)
> return converted
Again, I would really rather stick with a simple rule of "files are
encoded, strings are unicode" and banish encoded strings completely.
Otherwise keeping track of what is and what is not unicode becomes a
huge headache. It also becomes hard to explain to other people what
exactly we are doing. The only place where .encode() appears now is
in
sys.stdout.write(new_text.encode(encoding))
Note that in this case I do the conversion without saving the encoded
string on purpose. If sys.stdout.write wants an encoded string,
that's fine - I'll give it to it, but I don't want to have any encoded
strings sticking around. If I had to keep them for any reasons, I
would make sure to prefix them with "encoded_"
- yuri
--
Yuri Takhteyev
http://www.freewisdom.org/
|