[Jython-dev] Unicode user and file names (and v2.7.1)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I've been working on http://bugs.jython.org/issue2356 which I'd like to 
get in 2.7.1 -- it seems rather poor that Jython simply does not run for 
users whose names have an un-American character ;). I know this issue is 
not a blocker in most minds.

I've made pretty good progress by allowing file names to be unicode 
objects more often than they would be in CPython 2, which usually 
returns them as bytes in some encoding that we may not know. I've got 
the launcher to work properly, and straightened the logic in our 
printing of trace-backs and exceptions from Java. Unicode file names 
seems the way to go for Jython because:

 1. Java gives us competently decoded unicode file names, from
    java.io.File, etc.. Re-encoding the result will be a pain (and
    overlooked).
 2. We appear not to have the codec we need ('mbcs'), that CPython
    reports on Windows via sys.getfilesystemencoding().
 3. We do this already. In 2.7.0, os.getcwd() returns unicode if necessary.

Most regression tests pass. However, I'm struggling with test_doctest. 
Problems arise when mixing unicode and bytes when one byte is 128 and 
over. This happens in ''.join(list) and formatted output like "%s %s" % 
(ustr, bstr). The behaviour of these is identical with CPython: they 
raise UnicodeDecodeError because the bytes are promoted to characters 
with a strict ascii interpretation. This happens a lot in doctest.py and 
traceback.py, for example, where file paths and stack dumps that include 
them, are now frequently unicode, while other inputs are byte data 
containing file paths presented in the console encoding.

I can beat this into submission with enough customisation of the stdlib 
modules, but that always makes me uncomfortable. I usually see that as a 
hint that user code might also need to change. This may be unfounded. I 
can probably ensure no impact to users of only ascii paths, and the 
others seem unable to run Jython at all (in the scope of this issue). 
However, I'm seriously wondering if I should pursue the approach where 
file names from Java are re-encoded to bytes (maybe as utf-8 
everywhere), but that's grim.

Thoughts?

-- 
Jeff Allen