From: tchomby <tc...@go...> - 2009-02-26 18:34:48
|
Can anyone tell me how to avoid this from python-markdown? MARKDOWN-CRITICAL: "UnicodeDecodeError: Markdown only accepts unicode or ascii input." I'm reading-in text files and passing some of the contents to python-markdown. The file contents are read into a list of strings like this: f = open(path,"r") lines = f.readlines() and this list of strings is later converted into one long string with join and then passed to python-markdown like this: from markdown import Markdown md = Markdown() def markdown(text): return md.convert(text) All this unicode stuff in python is really confusing. |
From: Yuri T. <qar...@gm...> - 2009-02-26 18:40:48
|
Assuming your file is encoded as UTF8, you should open it with: f = codecs.open("test.txt", mode="r", encoding="utf8") > All this unicode stuff in python is really confusing. Yes, it is. - yuri -- http://spu.tnik.org/ |
From: tchomby <tc...@go...> - 2009-02-26 19:17:06
|
Thanks. I don't know what encoding the files are in. They're just files that I created myself with a text editor, but often text has been copy-pasted into them from various sources, e.g. websites, and that's were the decoding problems occur. Presumably some non-utf8 characters get pasted in. I used the codecs.open trick when reading files and again when writing the HTML from python-markdown, wherever I was using open I replaced it with codecs.open. This works for most of my files but for some I get: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 2551: unexpected code byte The error happens when I call template.substitute in this function: def render_template(template_filename,variables=None): if variables is None: variables = {} template_path = os.path.join('templates',template_filename) template_text = codecs.open(template_path,mode='r',encoding='utf8').read() template_obj = Template(template_text) return template_obj.substitute(variables) So the error is no longer coming from python-markdown but from the standard library. Seems to be some conflict between using codecs.open to get a string and using Template. Fortunately this happened in few enough files that I was able to find and remove the offending characters manually. Still, it would be good to be able to read and write text from files in a robust way. |
From: Ron G. <ro...@fl...> - 2009-02-26 19:32:14
Attachments:
smime.p7s
|
On Feb 26, 2009, at 11:17 AM, tchomby wrote: > Thanks. > > I don't know what encoding the files are in. They're just files that I > created myself with a text editor, but often text has been copy-pasted > into them from various sources, e.g. websites, and that's were the > decoding problems occur. Presumably some non-utf8 characters get > pasted in. > > I used the codecs.open trick when reading files and again when writing > the HTML from python-markdown, wherever I was using open I replaced it > with codecs.open. This works for most of my files but for some I get: > > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position > 2551: unexpected code byte This is because you have a string that is encoded using some encoding other than utf-8, most likely latin-1. You might find this example enlightening: >>> unicode('\xa2', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 0: unexpected code byte >>> unicode('\xa2', 'latin-1') u'\xa2' >>> print _ ¢ >>> unicode('\xa2', 'latin-1').encode('utf-8') '\xc2\xa2' >>> unicode('\xc2\xa2', 'utf-8') u'\xa2' rg |
From: Marco P. <mar...@gm...> - 2009-02-27 10:08:56
|
What about doing: fh = open(template_path, 'r') value = fh.read() fh.close() value = codecs.encode(value, 'utf-8', 'replace') ? Cheers, Marco On Thu, Feb 26, 2009 at 8:17 PM, tchomby <tc...@go...> wrote: > Thanks. > > I don't know what encoding the files are in. They're just files that I > created myself with a text editor, but often text has been copy-pasted > into them from various sources, e.g. websites, and that's were the > decoding problems occur. Presumably some non-utf8 characters get > pasted in. > > I used the codecs.open trick when reading files and again when writing > the HTML from python-markdown, wherever I was using open I replaced it > with codecs.open. This works for most of my files but for some I get: On Thu, Feb 26, 2009 at 8:17 PM, tchomby <tc...@go...> wrote: > Thanks. > > I don't know what encoding the files are in. They're just files that I > created myself with a text editor, but often text has been copy-pasted > into them from various sources, e.g. websites, and that's were the > decoding problems occur. Presumably some non-utf8 characters get > pasted in. > > I used the codecs.open trick when reading files and again when writing > the HTML from python-markdown, wherever I was using open I replaced it > with codecs.open. This works for most of my files but for some I get: On Thu, Feb 26, 2009 at 8:17 PM, tchomby <tc...@go...> wrote: > Thanks. > > I don't know what encoding the files are in. They're just files that I > created myself with a text editor, but often text has been copy-pasted > into them from various sources, e.g. websites, and that's were the > decoding problems occur. Presumably some non-utf8 characters get > pasted in. > > I used the codecs.open trick when reading files and again when writing > the HTML from python-markdown, wherever I was using open I replaced it > with codecs.open. This works for most of my files but for some I get: > > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position > 2551: unexpected code byte > > The error happens when I call template.substitute in this function: > > def render_template(template_filename,variables=None): > if variables is None: variables = {} > template_path = os.path.join('templates',template_filename) > template_text = codecs.open(template_path,mode='r',encoding='utf8').read() > template_obj = Template(template_text) > return template_obj.substitute(variables) > > So the error is no longer coming from python-markdown but from the > standard library. Seems to be some conflict between using codecs.open > to get a string and using Template. > > Fortunately this happened in few enough files that I was able to find > and remove the offending characters manually. Still, it would be good > to be able to read and write text from files in a robust way. > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- Marco Pantaleoni |
From: Waylan L. <wa...@gm...> - 2009-02-26 19:57:28
|
On Thu, Feb 26, 2009 at 2:17 PM, tchomby <tc...@go...> wrote: [snip] > > Fortunately this happened in few enough files that I was able to find > and remove the offending characters manually. Still, it would be good > to be able to read and write text from files in a robust way. > Keep in mind that you just explained the various sources of your files earlier in your message. That particular situation is unique to you. Someone else will have a different situation. It is impossible for the markdown library to be able to anticipate every possible situation. Therefore, the most "robust way" is to leave the encoding/decoding to the end user - the only person is a position to properly address that specific situation. In other words, you are in a better position to know and/or determine the encoding of your files than we or any code we write could ever guess. Therefore, Python-Markdown's policy is to only work in Unicode. Any encoding and/or decoding is handled by the end user. That is the most "robust way" to handle it. As an aside, I should note that there is an exception in that we do handle some encoding/decoding for the command line stuff. However, even then, it is rather dumb and requires the user to specify the encoding for anything except utf-8 (which it expects by default). -- ---- \X/ /-\ `/ |_ /-\ |\| Waylan Limberg |
From: tchomby <tc...@go...> - 2009-02-27 10:08:48
|
On Thu, Feb 26, 2009 at 02:57:16PM -0500, Waylan Limberg wrote: > > Keep in mind that you just explained the various sources of your files > earlier in your message. That particular situation is unique to you. > Someone else will have a different situation. It is impossible for the > markdown library to be able to anticipate every possible situation. > Therefore, the most "robust way" is to leave the encoding/decoding to > the end user - the only person is a position to properly address that > specific situation. Yes, I think python-markdown made the right decision. Actually I wasn't complaining about python-markdown but python, although maybe there's nothing python can do about it either, but I think it would be nice if I could just use the standard open function and it would figure out what the encoding of the file was so I didn't have to. I'm not sure what you can do if you have files like mine that apparently contains text in different encodings (I think Ron is exactly right that my file contained utf8 and some latin-1 characters). Can you decode that at all? You'd have to right code to decode it one character at a time (if that's possible) using utf8 and on each character catch the UnicodeDecodeError and try to decode the character with latin-1 instead. You'd have to have a list of all possible encodings in order of preference and try each encoding on each character in turn until you've decoded the whole file. |
From: Marco P. <mar...@gm...> - 2009-02-27 10:14:53
|
> Yes, I think python-markdown made the right decision. Actually I wasn't > complaining about python-markdown but python, although maybe there's nothing > python can do about it either, but I think it would be nice if I could just use > the standard open function and it would figure out what the encoding of the > file was so I didn't have to. It's not entirely possible with perfect determinism. And even if it was, it wouldn't solve the problem when a file has mixed encodings. > > I'm not sure what you can do if you have files like mine that apparently > contains text in different encodings (I think Ron is exactly right that my file > contained utf8 and some latin-1 characters). Can you decode that at all? You'd > have to right code to decode it one character at a time (if that's possible) > using utf8 and on each character catch the UnicodeDecodeError and try to decode > the character with latin-1 instead. You'd have to have a list of all possible > encodings in order of preference and try each encoding on each character in > turn until you've decoded the whole file. It's not possible since in general the "character" doesn't correspond to the file atomic entity (the byte). And in utf-8 characters don't have a fixed length in term of bytes. Moreover to determine the encoding of a piece of text you need to look at many bytes. The only thing you can do is fix the encoding for the file (or try to guess it in some way), and ignore or replace substrings which don't map to the encoding. Ciao, Marco > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- Marco Pantaleoni |