From: SourceForge.net <no...@so...> - 2011-08-21 22:41:19
|
Bugs item #3395948, was opened at 2011-08-22 00:41 Message generated for change (Tracker Item Submitted) made by ubanus You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Jakub Wilk (ubanus) Assigned to: Nobody/Anonymous (nobody) Summary: C locale + Python 3 -> UnicodeDecodeError Initial Comment: When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X: $ printf '\303\263' > test.xml $ rst2xml.py --version rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2) $ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK OK $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null Traceback (most recent call last): File "/usr/local/bin/rst2xml.py", line 23, in <module> publish_cmdline(writer_name='xml', description=description) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline config_section=config_section, enable_exit_status=enable_exit_status) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish self.settings) File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read self.input = self.source.read() File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read data = self.source.read() File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 |
From: SourceForge.net <no...@so...> - 2011-08-22 13:21:20
|
Bugs item #3395948, was opened at 2011-08-21 22:41 Message generated for change (Comment added) made by milde You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Jakub Wilk (ubanus) Assigned to: Nobody/Anonymous (nobody) Summary: C locale + Python 3 -> UnicodeDecodeError Initial Comment: When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X: $ printf '\303\263' > test.xml $ rst2xml.py --version rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2) $ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK OK $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null Traceback (most recent call last): File "/usr/local/bin/rst2xml.py", line 23, in <module> publish_cmdline(writer_name='xml', description=description) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline config_section=config_section, enable_exit_status=enable_exit_status) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish self.settings) File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read self.input = self.source.read() File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read data = self.source.read() File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) ---------------------------------------------------------------------- >Comment By: Günter Milde (milde) Date: 2011-08-22 13:21 Message: Thanks for the bug report -- however, I am not sure the behaviour is a bug. It is the standard Python 3 response to non-ASCII characters when no encoding is specified. With Python 2, Docutils does the input file decoding (including some guesswork), with Python 3 the standard file.read() method also decodes the result into a unicode string. Using "binary" mode is no sensible option: * rst files are text, not binary data * we lose the universal newline support (NL vs CR vs. CR/NL issue with different OS) Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8 We might consider catching the error and writing a more helpfull message, but this should be discussed in the docutils-devel list. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 |
From: SourceForge.net <no...@so...> - 2011-10-16 13:27:34
|
Bugs item #3395948, was opened at 2011-08-22 00:41 Message generated for change (Comment added) made by jakub-wilk You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Jakub Wilk (ubanus) Assigned to: Nobody/Anonymous (nobody) Summary: C locale + Python 3 -> UnicodeDecodeError Initial Comment: When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X: $ printf '\303\263' > test.xml $ rst2xml.py --version rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2) $ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK OK $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null Traceback (most recent call last): File "/usr/local/bin/rst2xml.py", line 23, in <module> publish_cmdline(writer_name='xml', description=description) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline config_section=config_section, enable_exit_status=enable_exit_status) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish self.settings) File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read self.input = self.source.read() File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read data = self.source.read() File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) ---------------------------------------------------------------------- Comment By: Jakub Wilk (jakub-wilk) Date: 2011-10-16 15:27 Message: (FWIW, I upgraded to Docutils 0.8.1 in the mean time.) I don't buy the "we can't guess encodings in Python 3" argument. In fact, Docutils is able to detect UTF-8 just fine when locale encoding is ISO-8859-n: $ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - $ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - $ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - Also, adding --input-encoding=utf8 doesn't really help (which might be another bug). rst2xml.py just dies with a very confusing error message: $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8 test.xml UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 263: ordinal not in range(128) The specified output encoding (utf-8) cannot handle all of the output. Try setting "--output-encoding-error-handler" to * "xmlcharrefreplace" (for HTML & XML output); the output will contain "b'ó'" and should be usable. * "backslashreplace" (for other output formats); look for "b'\\xf3'" in the output. * "replace"; look for "?" in the output. "--output-encoding-error-handler" is currently set to "xmlcharrefreplace". Exiting due to error. Use "--traceback" to diagnose. If the advice above doesn't eliminate the error, please report it to <doc...@li...>. Include "--traceback" output, Docutils version (0.8.1), Python version (3.2.2rc1), your OS type & version, and the command line used. ---------------------------------------------------------------------- Comment By: Günter Milde (milde) Date: 2011-08-22 15:21 Message: Thanks for the bug report -- however, I am not sure the behaviour is a bug. It is the standard Python 3 response to non-ASCII characters when no encoding is specified. With Python 2, Docutils does the input file decoding (including some guesswork), with Python 3 the standard file.read() method also decodes the result into a unicode string. Using "binary" mode is no sensible option: * rst files are text, not binary data * we lose the universal newline support (NL vs CR vs. CR/NL issue with different OS) Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8 We might consider catching the error and writing a more helpfull message, but this should be discussed in the docutils-devel list. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 |
From: SourceForge.net <no...@so...> - 2011-10-20 23:05:08
|
Bugs item #3395948, was opened at 2011-08-21 22:41 Message generated for change (Comment added) made by milde You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: Jakub Wilk (ubanus) Assigned to: Nobody/Anonymous (nobody) Summary: C locale + Python 3 -> UnicodeDecodeError Initial Comment: When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X: $ printf '\303\263' > test.xml $ rst2xml.py --version rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2) $ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK OK $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null Traceback (most recent call last): File "/usr/local/bin/rst2xml.py", line 23, in <module> publish_cmdline(writer_name='xml', description=description) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline config_section=config_section, enable_exit_status=enable_exit_status) File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish self.settings) File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read self.input = self.source.read() File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read data = self.source.read() File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) ---------------------------------------------------------------------- >Comment By: Günter Milde (milde) Date: 2011-10-20 23:05 Message: > Docutils is able to detect UTF-8 just fine when locale encoding is 8859-n: I cannot reproduce this: $ LC_ALL=en_US.ISO-8859-1 python3 Python 3.2.1rc1 (default, May 18 2011, 11:01:17) [GCC 4.6.1 20110507 (prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f = open('umlauts.txt') >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128) > Also, adding --input-encoding=utf8 doesn't really help (which might be > another bug). rst2xml.py just dies with a very confusing error message: > > $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8 > test.xml > UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in > position 263: ordinal not in range(128) Up to here, the error message is more than clear. The problem is that Docutils find sys.stdout already open with the encoding and error handler set and hence ignores the settings reported in the remainder of the error message. This is indeed bug. It can be worked around by * specifying the expected input/output encoding in the LANG variable, or * specifying --input-encoding and an output file (which is then opened with the given encoding). Nonetheless, both problems should be solved with the latest SVN version. ---------------------------------------------------------------------- Comment By: Jakub Wilk (jakub-wilk) Date: 2011-10-16 13:27 Message: (FWIW, I upgraded to Docutils 0.8.1 in the mean time.) I don't buy the "we can't guess encodings in Python 3" argument. In fact, Docutils is able to detect UTF-8 just fine when locale encoding is ISO-8859-n: $ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - $ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - $ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum 2dfeff49a2ce2aa24d6217e0160a8326 - Also, adding --input-encoding=utf8 doesn't really help (which might be another bug). rst2xml.py just dies with a very confusing error message: $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8 test.xml UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 263: ordinal not in range(128) The specified output encoding (utf-8) cannot handle all of the output. Try setting "--output-encoding-error-handler" to * "xmlcharrefreplace" (for HTML & XML output); the output will contain "b'ó'" and should be usable. * "backslashreplace" (for other output formats); look for "b'\\xf3'" in the output. * "replace"; look for "?" in the output. "--output-encoding-error-handler" is currently set to "xmlcharrefreplace". Exiting due to error. Use "--traceback" to diagnose. If the advice above doesn't eliminate the error, please report it to <doc...@li...>. Include "--traceback" output, Docutils version (0.8.1), Python version (3.2.2rc1), your OS type & version, and the command line used. ---------------------------------------------------------------------- Comment By: Günter Milde (milde) Date: 2011-08-22 13:21 Message: Thanks for the bug report -- however, I am not sure the behaviour is a bug. It is the standard Python 3 response to non-ASCII characters when no encoding is specified. With Python 2, Docutils does the input file decoding (including some guesswork), with Python 3 the standard file.read() method also decodes the result into a unicode string. Using "binary" mode is no sensible option: * rst files are text, not binary data * we lose the universal newline support (NL vs CR vs. CR/NL issue with different OS) Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8 We might consider catching the error and writing a more helpfull message, but this should be discussed in the docutils-devel list. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414 |