Re: [Jython-dev] Unicode user and file names (and v2.7.1)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I went for sys.getfilesystemencoding() == 'utf-8' and it works pretty 
well. Rather than just push directly I have published to here:

https://bitbucket.org/tournesol/jython-utf8

I write to ask for a second or third pair of eyes on it. Please tell me 
you can see it and whether it breaks things you care about.

I touched a lot of files in the core and import system: quite a lot of 
tricky stuff with loaders and search paths has been adjusted. I think it 
a good sign that I changed hardly anything in the standard library we 
inherit from CPython, that we hadn't already specialised.

By "works pretty well" above, I mean that the regression tests run 
cleanly for me when my user name is "Épreuve", where previously Jython 
died horribly. The launcher works from a Chinese user name too, as long 
as I localise Windows to China (CPython 2.7 feature). I can use the 
prompt and runs some tests with that setup, but I can't run the 
regression test yet, and printing a stack dump is fatal, so there's a 
bit more to do for Chinese.

I think this means we have solid support for "latin-1" languages, but 
there are still places where we fatally assume bytes are Unicode code 
points.

Jeff Allen

On 05/04/2017 08:57, Jeff Allen wrote:
> I've been working on http://bugs.jython.org/issue2356 which I'd like to
> get in 2.7.1 -- it seems rather poor that Jython simply does not run for
> users whose names have an un-American character ;). I know this issue is
> not a blocker in most minds.
>
> I've made pretty good progress by allowing file names to be unicode
> objects more often than they would be in CPython 2, which usually
> returns them as bytes in some encoding that we may not know. I've got
> the launcher to work properly, and straightened the logic in our
> printing of trace-backs and exceptions from Java. Unicode file names
> seems the way to go for Jython because:
>
>   1. Java gives us competently decoded unicode file names, from
>      java.io.File, etc.. Re-encoding the result will be a pain (and
>      overlooked).
>   2. We appear not to have the codec we need ('mbcs'), that CPython
>      reports on Windows via sys.getfilesystemencoding().
>   3. We do this already. In 2.7.0, os.getcwd() returns unicode if necessary.
>
> Most regression tests pass. However, I'm struggling with test_doctest.
> Problems arise when mixing unicode and bytes when one byte is 128 and
> over. This happens in ''.join(list) and formatted output like "%s %s" %
> (ustr, bstr). The behaviour of these is identical with CPython: they
> raise UnicodeDecodeError because the bytes are promoted to characters
> with a strict ascii interpretation. This happens a lot in doctest.py and
> traceback.py, for example, where file paths and stack dumps that include
> them, are now frequently unicode, while other inputs are byte data
> containing file paths presented in the console encoding.
>
> I can beat this into submission with enough customisation of the stdlib
> modules, but that always makes me uncomfortable. I usually see that as a
> hint that user code might also need to change. This may be unfounded. I
> can probably ensure no impact to users of only ascii paths, and the
> others seem unable to run Jython at all (in the scope of this issue).
> However, I'm seriously wondering if I should pursue the approach where
> file names from Java are re-encoded to bytes (maybe as utf-8
> everywhere), but that's grim.
>
> Thoughts?
>