Re: [Docutils-develop] [ docutils-Bugs-744982 ] Unicode error in utils if error message quotes non-

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

[Jens Quade]
>>> This could be another configuration option, with a default of
>>> UTF-8.

[David Goodger]
>> If stderr is ASCII-encoded, any character whose ord()> 127 will
>> cause a traceback.

[Jens Quade]
> A stream is a sequence of bytes, and has no inherent encoding. A
> non-Unicode-String in Python is also a sequence of bytes.  It is
> possible to write any character < 256 onto a stream, unless it's
> part of a Unicode string:

This is true, but beside the point (and unrelated to the text I
originally quoted).  The point is that if we have a configuration
option for the stderr stream encoding, and that option is set to
'ASCII', and some ord(character) > 127, we will get a traceback:

>>> settings.error_encoding
'ASCII'
>>> sys.stderr.write(u'\u00fc.encode(settings.error_encoding))
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

But as you (seem to have) pointed out we can use "errors='ignore'":

>>> u'\u00fc'.encode('ascii', 'ignore')
''

Not very useful though, since potentially important characters
disappear.  Better is 'replace':

>>> u'\u00fc'.encode('ascii', 'replace'))
'?'

But not by much.  repr() isn't acceptable, because it does too much:

>>> repr(u'\u00fc')
"u'\\xfc'"

> PEP293 introduces encoding error callback functions into Python 2.3.

A little digging reveals the solution:

>>> u'\u00fc'.encode('ascii', 'backslashreplace')
'\\xfc'

This only works under Python 2.3 though.  I think the best solution
would be to establish a runtime setting with a version-agnostic
default set at startup::

     import codecs
     try:
         codecs.backslashreplace_errors
         settings.error_callback = 'backslashreplace'
     except AttributeError:
         settings.error_callback = 'replace'

And in docutils.utils.Reporter.system_message use::

     msgtext = unicode(msg.astext()).encode(
           settings.error_encoding, settings.error_callback)

As for "--error-encoding", the default should be 'ASCII' as the lowest
common denominator.  Reporter objects don't know about runtime
settings now; either the settings object or the
settings.error_encoding and settings.error_callback values will have
to be passed to the constructor.  Or the "stream" object in each
ConditionSet could be wrapped by ``codecs.EncodedFile``.  Or something
like that; I don't have the will right now to figure out what's
correct.

>>> I'll add another short demo, containing some kanji characters.
>>
>> "Fireworks"!  How does that work with your patch?
> 
> It works fine. And with a UTF-8 xterm or Terminal, it is even
> readable in the error message.  With a latin-1 terminal, it's still
> printed, but not readable.

"Printed, but not readable" is not very useful.  "?" or "\u####" is
better than garbage.

> Another thought: If the error messages only quote text from the
> original file, it would be possible to default to the encoding used
> for the source file.

I don't think we can safely assume that input encoding and terminal
encoding are related.  Better to be explicit.

-- 
David Goodger    http://starship.python.net/~goodger

Programmer/sysadmin for hire: http://starship.python.net/~goodger/cv

Re: [Docutils-develop] [ docutils-Bugs-744982 ] Unicode error in utils if error message quotes non-

Re: [Docutils-develop] [ docutils-Bugs-744982 ] Unicode error in utils if error message quotes non-ASCII