From: Jeff A. <ja...@fa...> - 2017-05-01 13:33:51
|
I went for sys.getfilesystemencoding() == 'utf-8' and it works pretty well. Rather than just push directly I have published to here: https://bitbucket.org/tournesol/jython-utf8 I write to ask for a second or third pair of eyes on it. Please tell me you can see it and whether it breaks things you care about. I touched a lot of files in the core and import system: quite a lot of tricky stuff with loaders and search paths has been adjusted. I think it a good sign that I changed hardly anything in the standard library we inherit from CPython, that we hadn't already specialised. By "works pretty well" above, I mean that the regression tests run cleanly for me when my user name is "Épreuve", where previously Jython died horribly. The launcher works from a Chinese user name too, as long as I localise Windows to China (CPython 2.7 feature). I can use the prompt and runs some tests with that setup, but I can't run the regression test yet, and printing a stack dump is fatal, so there's a bit more to do for Chinese. I think this means we have solid support for "latin-1" languages, but there are still places where we fatally assume bytes are Unicode code points. Jeff Allen On 05/04/2017 08:57, Jeff Allen wrote: > I've been working on http://bugs.jython.org/issue2356 which I'd like to > get in 2.7.1 -- it seems rather poor that Jython simply does not run for > users whose names have an un-American character ;). I know this issue is > not a blocker in most minds. > > I've made pretty good progress by allowing file names to be unicode > objects more often than they would be in CPython 2, which usually > returns them as bytes in some encoding that we may not know. I've got > the launcher to work properly, and straightened the logic in our > printing of trace-backs and exceptions from Java. Unicode file names > seems the way to go for Jython because: > > 1. Java gives us competently decoded unicode file names, from > java.io.File, etc.. Re-encoding the result will be a pain (and > overlooked). > 2. We appear not to have the codec we need ('mbcs'), that CPython > reports on Windows via sys.getfilesystemencoding(). > 3. We do this already. In 2.7.0, os.getcwd() returns unicode if necessary. > > Most regression tests pass. However, I'm struggling with test_doctest. > Problems arise when mixing unicode and bytes when one byte is 128 and > over. This happens in ''.join(list) and formatted output like "%s %s" % > (ustr, bstr). The behaviour of these is identical with CPython: they > raise UnicodeDecodeError because the bytes are promoted to characters > with a strict ascii interpretation. This happens a lot in doctest.py and > traceback.py, for example, where file paths and stack dumps that include > them, are now frequently unicode, while other inputs are byte data > containing file paths presented in the console encoding. > > I can beat this into submission with enough customisation of the stdlib > modules, but that always makes me uncomfortable. I usually see that as a > hint that user code might also need to change. This may be unfounded. I > can probably ensure no impact to users of only ascii paths, and the > others seem unable to run Jython at all (in the scope of this issue). > However, I'm seriously wondering if I should pursue the approach where > file names from Java are re-encoded to bytes (maybe as utf-8 > everywhere), but that's grim. > > Thoughts? > |