Menu

#170 C locale + Python 3 -> UnicodeDecodeError

closed-fixed
nobody
None
5
2011-10-20
2011-08-21
U
No

When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X:

$ printf '\303\263' > test.xml

$ rst2xml.py --version
rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2)

$ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK
OK

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null
Traceback (most recent call last):
File "/usr/local/bin/rst2xml.py", line 23, in <module>
publish_cmdline(writer_name='xml', description=description)
File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline
config_section=config_section, enable_exit_status=enable_exit_status)
File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish
self.settings)
File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read
self.input = self.source.read()
File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read
data = self.source.read()
File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Discussion

  • Günter Milde

    Günter Milde - 2011-08-22

    Thanks for the bug report -- however, I am not sure the behaviour is a bug.

    It is the standard Python 3 response to non-ASCII characters when no encoding is specified.
    With Python 2, Docutils does the input file decoding (including some guesswork),
    with Python 3 the standard file.read() method also decodes the result into a unicode string.
    Using "binary" mode is no sensible option:
    * rst files are text, not binary data
    * we lose the universal newline support (NL vs CR vs. CR/NL issue with different OS)

    Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8

    We might consider catching the error and writing a more helpfull message, but this should be discussed in the docutils-devel list.

     
  • Jakub Wilk

    Jakub Wilk - 2011-10-16

    (FWIW, I upgraded to Docutils 0.8.1 in the mean time.)

    I don't buy the "we can't guess encodings in Python 3" argument. In fact, Docutils is able to detect UTF-8 just fine when locale encoding is ISO-8859-n:

    $ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
    2dfeff49a2ce2aa24d6217e0160a8326 -

    $ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
    2dfeff49a2ce2aa24d6217e0160a8326 -

    $ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
    2dfeff49a2ce2aa24d6217e0160a8326 -

    Also, adding --input-encoding=utf8 doesn't really help (which might be another bug). rst2xml.py just dies with a very confusing error message:

    $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8 test.xml
    UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 263: ordinal not in range(128)

    The specified output encoding (utf-8) cannot
    handle all of the output.
    Try setting "--output-encoding-error-handler" to

    * "xmlcharrefreplace" (for HTML & XML output);
    the output will contain "b'&#243;'" and should be usable.
    * "backslashreplace" (for other output formats);
    look for "b'\\xf3'" in the output.
    * "replace"; look for "?" in the output.

    "--output-encoding-error-handler" is currently set to "xmlcharrefreplace".

    Exiting due to error. Use "--traceback" to diagnose.
    If the advice above doesn't eliminate the error,
    please report it to <docutils-users@lists.sf.net>.
    Include "--traceback" output, Docutils version (0.8.1),
    Python version (3.2.2rc1), your OS type & version, and the
    command line used.

     
  • Günter Milde

    Günter Milde - 2011-10-20

    > Docutils is able to detect UTF-8 just fine when locale encoding is 8859-n:

    I cannot reproduce this:

    $ LC_ALL=en_US.ISO-8859-1 python3
    Python 3.2.1rc1 (default, May 18 2011, 11:01:17)
    [GCC 4.6.1 20110507 (prerelease)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> f = open('umlauts.txt')
    >>> f.read()
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

    > Also, adding --input-encoding=utf8 doesn't really help (which might be
    > another bug). rst2xml.py just dies with a very confusing error message:
    >
    > $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8
    > test.xml
    > UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
    > position 263: ordinal not in range(128)

    Up to here, the error message is more than clear.

    The problem is that Docutils find sys.stdout already open with the encoding
    and error handler set and hence ignores the settings reported in the
    remainder of the error message. This is indeed bug. It can be worked around
    by

    * specifying the expected input/output encoding in the LANG variable, or
    * specifying --input-encoding and an output file (which is then opened with the given encoding).

    Nonetheless, both problems should be solved with the latest SVN version.

     
  • Günter Milde

    Günter Milde - 2011-10-20
    • status: open --> closed-fixed
     

Log in to post a comment.