Thread: [Docutils-develop] [ docutils-Bugs-3395948 ] C locale + Python 3 -> UnicodeDecodeError

Brought to you by: goodger, grubert, milde, tibs, wiemann

docutils-develop

[Docutils-develop] [ docutils-Bugs-3395948 ] C locale + Python 3 -> UnicodeDecodeError

From: SourceForge.net <no...@so...> - 2011-08-21 22:41:19

Bugs item #3395948, was opened at 2011-08-22 00:41
Message generated for change (Tracker Item Submitted) made by ubanus
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jakub Wilk (ubanus)
Assigned to: Nobody/Anonymous (nobody)
Summary: C locale + Python 3 -> UnicodeDecodeError

Initial Comment:
When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X:

$ printf '\303\263' > test.xml

$ rst2xml.py --version
rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2)

$ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK
OK

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null
Traceback (most recent call last):
  File "/usr/local/bin/rst2xml.py", line 23, in <module>
    publish_cmdline(writer_name='xml', description=description)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline
    config_section=config_section, enable_exit_status=enable_exit_status)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish
    self.settings)
  File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read
    self.input = self.source.read()
  File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read
    data = self.source.read()
  File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

[Docutils-develop] [ docutils-Bugs-3395948 ] C locale + Python 3 -> UnicodeDecodeError

From: SourceForge.net <no...@so...> - 2011-08-22 13:21:20

Bugs item #3395948, was opened at 2011-08-21 22:41
Message generated for change (Comment added) made by milde
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jakub Wilk (ubanus)
Assigned to: Nobody/Anonymous (nobody)
Summary: C locale + Python 3 -> UnicodeDecodeError

Initial Comment:
When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X:

$ printf '\303\263' > test.xml

$ rst2xml.py --version
rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2)

$ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK
OK

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null
Traceback (most recent call last):
  File "/usr/local/bin/rst2xml.py", line 23, in <module>
    publish_cmdline(writer_name='xml', description=description)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline
    config_section=config_section, enable_exit_status=enable_exit_status)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish
    self.settings)
  File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read
    self.input = self.source.read()
  File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read
    data = self.source.read()
  File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


----------------------------------------------------------------------

>Comment By: Günter Milde (milde)
Date: 2011-08-22 13:21

Message:
Thanks for the bug report -- however, I am not sure the behaviour is a
bug.

It is the standard Python 3 response to  non-ASCII characters when no
encoding is specified.
With Python 2, Docutils does the input file decoding (including some
guesswork), 
with Python 3 the standard file.read() method also decodes the result into
a unicode string.
Using "binary" mode is no sensible option:
* rst files are text, not binary data
* we lose the universal newline support (NL vs CR vs. CR/NL issue with
different OS)

Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8

We might consider catching the error and writing a more helpfull message,
but this should be discussed in the docutils-devel list.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

[Docutils-develop] [ docutils-Bugs-3395948 ] C locale + Python 3 -> UnicodeDecodeError

From: SourceForge.net <no...@so...> - 2011-10-16 13:27:34

Bugs item #3395948, was opened at 2011-08-22 00:41
Message generated for change (Comment added) made by jakub-wilk
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jakub Wilk (ubanus)
Assigned to: Nobody/Anonymous (nobody)
Summary: C locale + Python 3 -> UnicodeDecodeError

Initial Comment:
When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X:

$ printf '\303\263' > test.xml

$ rst2xml.py --version
rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2)

$ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK
OK

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null
Traceback (most recent call last):
  File "/usr/local/bin/rst2xml.py", line 23, in <module>
    publish_cmdline(writer_name='xml', description=description)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline
    config_section=config_section, enable_exit_status=enable_exit_status)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish
    self.settings)
  File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read
    self.input = self.source.read()
  File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read
    data = self.source.read()
  File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


----------------------------------------------------------------------

Comment By: Jakub Wilk (jakub-wilk)
Date: 2011-10-16 15:27

Message:
(FWIW, I upgraded to Docutils 0.8.1 in the mean time.)

I don't buy the "we can't guess encodings in Python 3" argument. In fact,
Docutils is able to detect UTF-8 just fine when locale encoding is
ISO-8859-n:

$ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml |
md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

$ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml |
md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

$ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

Also, adding --input-encoding=utf8 doesn't really help (which might be
another bug). rst2xml.py just dies with a very confusing error message:

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8
test.xml
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
position 263: ordinal not in range(128)

The specified output encoding (utf-8) cannot
handle all of the output.
Try setting "--output-encoding-error-handler" to

* "xmlcharrefreplace" (for HTML & XML output);
  the output will contain "b'&#243;'" and should be usable.
* "backslashreplace" (for other output formats);
  look for "b'\\xf3'" in the output.
* "replace"; look for "?" in the output.

"--output-encoding-error-handler" is currently set to
"xmlcharrefreplace".

Exiting due to error.  Use "--traceback" to diagnose.
If the advice above doesn't eliminate the error,
please report it to <doc...@li...>.
Include "--traceback" output, Docutils version (0.8.1),
Python version (3.2.2rc1), your OS type & version, and the
command line used.


----------------------------------------------------------------------

Comment By: Günter Milde (milde)
Date: 2011-08-22 15:21

Message:
Thanks for the bug report -- however, I am not sure the behaviour is a
bug.

It is the standard Python 3 response to  non-ASCII characters when no
encoding is specified.
With Python 2, Docutils does the input file decoding (including some
guesswork), 
with Python 3 the standard file.read() method also decodes the result into
a unicode string.
Using "binary" mode is no sensible option:
* rst files are text, not binary data
* we lose the universal newline support (NL vs CR vs. CR/NL issue with
different OS)

Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8

We might consider catching the error and writing a more helpfull message,
but this should be discussed in the docutils-devel list.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

[Docutils-develop] [ docutils-Bugs-3395948 ] C locale + Python 3 -> UnicodeDecodeError

From: SourceForge.net <no...@so...> - 2011-10-20 23:05:08

Bugs item #3395948, was opened at 2011-08-21 22:41
Message generated for change (Comment added) made by milde
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Jakub Wilk (ubanus)
Assigned to: Nobody/Anonymous (nobody)
Summary: C locale + Python 3 -> UnicodeDecodeError

Initial Comment:
When using C locale and Python 3.X, I cannot convert reST documents containing non-ASCII character. It works fine when using Python 2.X:

$ printf '\303\263' > test.xml

$ rst2xml.py --version
rst2xml.py (Docutils 0.8 [release], Python 3.2.2rc1, on linux2)

$ LC_ALL=C python /usr/local/bin/rst2xml.py test.xml > /dev/null && echo OK
OK

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --traceback test.xml > /dev/null
Traceback (most recent call last):
  File "/usr/local/bin/rst2xml.py", line 23, in <module>
    publish_cmdline(writer_name='xml', description=description)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 339, in publish_cmdline
    config_section=config_section, enable_exit_status=enable_exit_status)
  File "/usr/local/lib/python3.2/dist-packages/docutils/core.py", line 211, in publish
    self.settings)
  File "/usr/local/lib/python3.2/dist-packages/docutils/readers/__init__.py", line 68, in read
    self.input = self.source.read()
  File "/usr/local/lib/python3.2/dist-packages/docutils/io.py", line 238, in read
    data = self.source.read()
  File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


----------------------------------------------------------------------

>Comment By: Günter Milde (milde)
Date: 2011-10-20 23:05

Message:
> Docutils is able to detect UTF-8 just fine when locale encoding is
8859-n:

I cannot reproduce this:

$ LC_ALL=en_US.ISO-8859-1 python3
Python 3.2.1rc1 (default, May 18 2011, 11:01:17) 
[GCC 4.6.1 20110507 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('umlauts.txt')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2:
ordinal not in range(128)

> Also, adding --input-encoding=utf8 doesn't really help (which might be
> another bug). rst2xml.py just dies with a very confusing error message:
> 
> $ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8
> test.xml
> UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
> position 263: ordinal not in range(128)

Up to here, the error message is more than clear. 

The problem is that Docutils find sys.stdout already open with the
encoding
and error handler set and hence ignores the settings reported in the
remainder of the error message. This is indeed bug. It can be worked
around
by 

* specifying the expected input/output encoding in the LANG variable, or
* specifying --input-encoding and an output file (which is then opened
with the given encoding).

Nonetheless, both problems should be solved with the latest SVN version.

----------------------------------------------------------------------

Comment By: Jakub Wilk (jakub-wilk)
Date: 2011-10-16 13:27

Message:
(FWIW, I upgraded to Docutils 0.8.1 in the mean time.)

I don't buy the "we can't guess encodings in Python 3" argument. In fact,
Docutils is able to detect UTF-8 just fine when locale encoding is
ISO-8859-n:

$ LC_ALL=en_US.ISO-8859-1 python3 /usr/local/bin/rst2xml.py test.xml |
md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

$ LC_ALL=pl_PL.ISO-8859-2 python3 /usr/local/bin/rst2xml.py test.xml |
md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

$ LC_ALL=en_US.UTF-8 python3 /usr/local/bin/rst2xml.py test.xml | md5sum
2dfeff49a2ce2aa24d6217e0160a8326  -

Also, adding --input-encoding=utf8 doesn't really help (which might be
another bug). rst2xml.py just dies with a very confusing error message:

$ LC_ALL=C python3 /usr/local/bin/rst2xml.py --input-encoding=utf8
test.xml
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
position 263: ordinal not in range(128)

The specified output encoding (utf-8) cannot
handle all of the output.
Try setting "--output-encoding-error-handler" to

* "xmlcharrefreplace" (for HTML & XML output);
  the output will contain "b'&#243;'" and should be usable.
* "backslashreplace" (for other output formats);
  look for "b'\\xf3'" in the output.
* "replace"; look for "?" in the output.

"--output-encoding-error-handler" is currently set to
"xmlcharrefreplace".

Exiting due to error.  Use "--traceback" to diagnose.
If the advice above doesn't eliminate the error,
please report it to <doc...@li...>.
Include "--traceback" output, Docutils version (0.8.1),
Python version (3.2.2rc1), your OS type & version, and the
command line used.


----------------------------------------------------------------------

Comment By: Günter Milde (milde)
Date: 2011-08-22 13:21

Message:
Thanks for the bug report -- however, I am not sure the behaviour is a
bug.

It is the standard Python 3 response to  non-ASCII characters when no
encoding is specified.
With Python 2, Docutils does the input file decoding (including some
guesswork), 
with Python 3 the standard file.read() method also decodes the result into
a unicode string.
Using "binary" mode is no sensible option:
* rst files are text, not binary data
* we lose the universal newline support (NL vs CR vs. CR/NL issue with
different OS)

Specify the input encoding, e.g. rst2xml.py --input-encoding=utf8

We might consider catching the error and writing a more helpfull message,
but this should be discussed in the docutils-devel list.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=422030&aid=3395948&group_id=38414