[Jython-dev] sys.setdefaultencoding and str to unicode coercion

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Incidental to working on http://bugs.jython.org/issue2632, I noticed 
that mixed comparisons of unicode and str do not produce the same 
results in Jython as in CPython.

CPython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

Jython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
True

CPython converts the str (or whatever is opposite on the ==) into a 
unicode, if it can. Jython just compares the internal Java string 
without reference to the default encoding. This is fairly minor when the 
default is ASCII but becomes quite significant when someone uses 
sys.setdefaultencoding('utf-8'), say, in site.py or with the reload(sys) 
trick.

This trick is unreliable and I think we would not recommend it to 
anyone. Nevertheless, some people find it the only way to use Python 2 
libraries that have not thoroughly provided for Unicode. Also, it makes 
a test I devised for the csv module work in CPython and fail in Jython. 
I got the impression you couldn't reload sys satisfactorily in Jython, 
but is seems to work.

If someone does use this trick, do we intend to approximate CPython 
behaviour as closely as we can?

Jeff

-- 
Jeff Allen