From: Yuri T. <qar...@gm...> - 2007-10-30 20:45:23
|
> > However, I see markdown.markdown() as a shortcut for the common case. > > So maybe we could add some basic encoding/decoding for common cases. > > Seems reasonable. Well, except that what I learned the hard way while adding unicode support to MD, is that there seems to be only one "right" way to work with unicode in Python: decode when you read the file and encode when you write. Once you got encoded strings flying around, it's a recipe for problems. So, I don't want to endorse passing encoded strings as "the common case." In most cases, reading the content of a file without decoding is a bad idea and I don't want to encourage people to do that. Instead, I want to stick with a simple rule: if it's a string, then its unicode. So, I think we should offer the following functions: 1. unicode text -> unicode html 2. file path for input, encoding -> unicode html 3. file path for input, encoding, file path for output -> (writes to file) I see markdown.markdown() as doing #1. markdown.markdownFromFile() now does #3. We _could_ change it to also do #2. We could make it always return the unicode string, and also write encoded output to "output" if that argument is set. (We should probably accept either a file name or a stream as that parameter.) Now, if people feel that there is a common (if ungodly) case when the user need to deal with incoming encoded strings, I suggest we add a new method for that: markdownFromEncodedString() which will do decoding and return unicode. Though, in that case it should really be enough to write markdown.markdown(unicode(my_ungodly_string, "utf8")) So, I am not sure if such a method is really needed. > I think *two* encodings is overkill for both markdown() and > markdownFromFile(). In the common case they will likely be the same and > it is so easy to do the conversion yourself if you want them to be > different. Again, markdown() will no longer have encoding. As to the second, I tend to agree, especially if markdownFromFile could return the unicode instead of writing it to a file. > I hope by 'fails gracefully' you mean 'raises UnicodeDecodeError'. What > else could you do? Start guessing encodings? I think we should raise an error. The only question is: should we return a better error message. > if encoding is not None: > text = text.decode(encoding) > converted = md.convert(text) > if encoding is not None: > converted = converted.encode(encoding) > return converted Again, I would really rather stick with a simple rule of "files are encoded, strings are unicode" and banish encoded strings completely. Otherwise keeping track of what is and what is not unicode becomes a huge headache. It also becomes hard to explain to other people what exactly we are doing. The only place where .encode() appears now is in sys.stdout.write(new_text.encode(encoding)) Note that in this case I do the conversion without saving the encoded string on purpose. If sys.stdout.write wants an encoded string, that's fine - I'll give it to it, but I don't want to have any encoded strings sticking around. If I had to keep them for any reasons, I would make sure to prefix them with "encoded_" - yuri -- Yuri Takhteyev http://www.freewisdom.org/ |