Docutils: Documentation Utilities / Bugs / #170 C locale + Python 3 -> UnicodeDecodeError

Günter Milde - 2011-08-22

Thanks for the bug report -- however, I am not sure the behaviour is a bug.

It is the standard Python 3 response to non-ASCII characters when no encoding is specified.
With Python 2, Docutils does the input file decoding (including some guesswork),
with Python 3 the standard file.read() method also decodes the result into a unicode string.
Using "binary" mode is no sensible option:
* rst files are text, not binary data
* we lose the universal newline support (NL vs CR vs. CR/NL issue with different OS)

Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8

We might consider catching the error and writing a more helpfull message, but this should be discussed in the docutils-devel list.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jakub Wilk - 2011-10-16

(FWIW, I upgraded to Docutils 0.8.1 in the mean time.)

I don't buy the "we can't guess encodings in Python 3" argument. In fact, Docutils is able to detect UTF-8 just fine when locale encoding is ISO-8859-n:

$ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
2dfeff49a2ce2aa24d6217e0160a8326 -

$ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
2dfeff49a2ce2aa24d6217e0160a8326 -

$ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
2dfeff49a2ce2aa24d6217e0160a8326 -

Also, adding --input-encoding=utf8 doesn't really help (which might be another bug). rst2xml.py just dies with a very confusing error message:

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8 test.xml
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 263: ordinal not in range(128)

The specified output encoding (utf-8) cannot
handle all of the output.
Try setting "--output-encoding-error-handler" to

* "xmlcharrefreplace" (for HTML & XML output);
the output will contain "b'ó'" and should be usable.
* "backslashreplace" (for other output formats);
look for "b'\\xf3'" in the output.
* "replace"; look for "?" in the output.

"--output-encoding-error-handler" is currently set to "xmlcharrefreplace".

Exiting due to error. Use "--traceback" to diagnose.
If the advice above doesn't eliminate the error,
please report it to <docutils-users@lists.sf.net>.
Include "--traceback" output, Docutils version (0.8.1),
Python version (3.2.2rc1), your OS type & version, and the
command line used.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2011-10-20

> Docutils is able to detect UTF-8 just fine when locale encoding is 8859-n:

I cannot reproduce this:

$ LC_ALL=en_US.ISO-8859-1 python3
Python 3.2.1rc1 (default, May 18 2011, 11:01:17)
[GCC 4.6.1 20110507 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('umlauts.txt')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

> Also, adding --input-encoding=utf8 doesn't really help (which might be
> another bug). rst2xml.py just dies with a very confusing error message:
>
> $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8
> test.xml
> UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
> position 263: ordinal not in range(128)

Up to here, the error message is more than clear.

The problem is that Docutils find sys.stdout already open with the encoding
and error handler set and hence ignores the settings reported in the
remainder of the error message. This is indeed bug. It can be worked around
by

* specifying the expected input/output encoding in the LANG variable, or
* specifying --input-encoding and an output file (which is then opened with the given encoding).

Nonetheless, both problems should be solved with the latest SVN version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2011-10-20

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

C locale + Python 3 -> UnicodeDecodeError

Searches

Help

#170 C locale + Python 3 -> UnicodeDecodeError

Discussion