From: Jeff A. <ja...@fa...> - 2017-04-05 07:58:07
|
I've been working on http://bugs.jython.org/issue2356 which I'd like to get in 2.7.1 -- it seems rather poor that Jython simply does not run for users whose names have an un-American character ;). I know this issue is not a blocker in most minds. I've made pretty good progress by allowing file names to be unicode objects more often than they would be in CPython 2, which usually returns them as bytes in some encoding that we may not know. I've got the launcher to work properly, and straightened the logic in our printing of trace-backs and exceptions from Java. Unicode file names seems the way to go for Jython because: 1. Java gives us competently decoded unicode file names, from java.io.File, etc.. Re-encoding the result will be a pain (and overlooked). 2. We appear not to have the codec we need ('mbcs'), that CPython reports on Windows via sys.getfilesystemencoding(). 3. We do this already. In 2.7.0, os.getcwd() returns unicode if necessary. Most regression tests pass. However, I'm struggling with test_doctest. Problems arise when mixing unicode and bytes when one byte is 128 and over. This happens in ''.join(list) and formatted output like "%s %s" % (ustr, bstr). The behaviour of these is identical with CPython: they raise UnicodeDecodeError because the bytes are promoted to characters with a strict ascii interpretation. This happens a lot in doctest.py and traceback.py, for example, where file paths and stack dumps that include them, are now frequently unicode, while other inputs are byte data containing file paths presented in the console encoding. I can beat this into submission with enough customisation of the stdlib modules, but that always makes me uncomfortable. I usually see that as a hint that user code might also need to change. This may be unfounded. I can probably ensure no impact to users of only ascii paths, and the others seem unable to run Jython at all (in the scope of this issue). However, I'm seriously wondering if I should pursue the approach where file names from Java are re-encoded to bytes (maybe as utf-8 everywhere), but that's grim. Thoughts? -- Jeff Allen |