From: David G. <go...@py...> - 2003-05-29 13:59:00
|
[Jens Quade] >>> This could be another configuration option, with a default of >>> UTF-8. [David Goodger] >> If stderr is ASCII-encoded, any character whose ord()> 127 will >> cause a traceback. [Jens Quade] > A stream is a sequence of bytes, and has no inherent encoding. A > non-Unicode-String in Python is also a sequence of bytes. It is > possible to write any character < 256 onto a stream, unless it's > part of a Unicode string: This is true, but beside the point (and unrelated to the text I originally quoted). The point is that if we have a configuration option for the stderr stream encoding, and that option is set to 'ASCII', and some ord(character) > 127, we will get a traceback: >>> settings.error_encoding 'ASCII' >>> sys.stderr.write(u'\u00fc.encode(settings.error_encoding)) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) But as you (seem to have) pointed out we can use "errors='ignore'": >>> u'\u00fc'.encode('ascii', 'ignore') '' Not very useful though, since potentially important characters disappear. Better is 'replace': >>> u'\u00fc'.encode('ascii', 'replace')) '?' But not by much. repr() isn't acceptable, because it does too much: >>> repr(u'\u00fc') "u'\\xfc'" > PEP293 introduces encoding error callback functions into Python 2.3. A little digging reveals the solution: >>> u'\u00fc'.encode('ascii', 'backslashreplace') '\\xfc' This only works under Python 2.3 though. I think the best solution would be to establish a runtime setting with a version-agnostic default set at startup:: import codecs try: codecs.backslashreplace_errors settings.error_callback = 'backslashreplace' except AttributeError: settings.error_callback = 'replace' And in docutils.utils.Reporter.system_message use:: msgtext = unicode(msg.astext()).encode( settings.error_encoding, settings.error_callback) As for "--error-encoding", the default should be 'ASCII' as the lowest common denominator. Reporter objects don't know about runtime settings now; either the settings object or the settings.error_encoding and settings.error_callback values will have to be passed to the constructor. Or the "stream" object in each ConditionSet could be wrapped by ``codecs.EncodedFile``. Or something like that; I don't have the will right now to figure out what's correct. >>> I'll add another short demo, containing some kanji characters. >> >> "Fireworks"! How does that work with your patch? > > It works fine. And with a UTF-8 xterm or Terminal, it is even > readable in the error message. With a latin-1 terminal, it's still > printed, but not readable. "Printed, but not readable" is not very useful. "?" or "\u####" is better than garbage. > Another thought: If the error messages only quote text from the > original file, it would be possible to default to the encoding used > for the source file. I don't think we can safely assume that input encoding and terminal encoding are related. Better to be explicit. -- David Goodger http://starship.python.net/~goodger Programmer/sysadmin for hire: http://starship.python.net/~goodger/cv |