Re: [Jython-dev] Unicode user and file names (and v2.7.1)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Darjus.

On inclusion, I'm happy to go with the community view, as always. On one 
of the related tickets (http://bugs.jython.org/issue1839), Jim said we'd 
get it in if timing allowed and there was some user support.

I'm very keen to see a 2.7.1 too. The last (soft) RC was unsuccessful, 
and we're still making changes, so I assume we're talking about another 
RC first rather than a release?

The UTF-8 work is nearly there, but not quite: one Linux defect to fix, 
as noted on the same issue by James against the "latin-1" version. After 
all the additions in the last couple of weeks (to get full BMP support), 
I'm happy to find from my Linux laptop that it is still the only thing I 
have to do. It looks trivial. I've been unable code at all for a few 
days, so haven't looked into a solution, but now I'm back I expect to 
nail it for us today or tomorrow.

I can, of course, merge all this myself and will. I shared your 
hesitancy initially, hence the fork repository, but it's turned out so 
well I feel it's now low risk, as long as we still have a few days.

I will now dive under the desk and wire up my Linux dev box.

Jeff Allen

On 16/05/2017 21:46, Darjus Loktevic wrote:
> Hey Jeff,
>
> It seems your last commit to this branch is of three days ago. Is this 
> ready for review? BTW, your changes look good to me.
> I'm a little hesitant to merge this since we've had an RC and REALLY 
> have to release 2.7.1 It's miles better than 2.7.0.
>
> Cheers,
> Darjus
>
> On Mon, May 1, 2017 at 6:34 AM Jeff Allen <ja...@fa... 
> <mailto:ja...@fa...>> wrote:
>
>     I went for sys.getfilesystemencoding() == 'utf-8' and it works pretty
>     well. Rather than just push directly I have published to here:
>
>     https://bitbucket.org/tournesol/jython-utf8
>
>     I write to ask for a second or third pair of eyes on it. Please
>     tell me
>     you can see it and whether it breaks things you care about.
>
>     I touched a lot of files in the core and import system: quite a lot of
>     tricky stuff with loaders and search paths has been adjusted. I
>     think it
>     a good sign that I changed hardly anything in the standard library we
>     inherit from CPython, that we hadn't already specialised.
>
>     By "works pretty well" above, I mean that the regression tests run
>     cleanly for me when my user name is "Épreuve", where previously Jython
>     died horribly. The launcher works from a Chinese user name too, as
>     long
>     as I localise Windows to China (CPython 2.7 feature). I can use the
>     prompt and runs some tests with that setup, but I can't run the
>     regression test yet, and printing a stack dump is fatal, so there's a
>     bit more to do for Chinese.
>
>     I think this means we have solid support for "latin-1" languages, but
>     there are still places where we fatally assume bytes are Unicode code
>     points.
>
>     Jeff Allen
>
>     On 05/04/2017 08:57, Jeff Allen wrote:
>     > I've been working on http://bugs.jython.org/issue2356 which I'd
>     like to
>     > get in 2.7.1 -- it seems rather poor that Jython simply does not
>     run for
>     > users whose names have an un-American character ;). I know this
>     issue is
>     > not a blocker in most minds.
>     >
>     > I've made pretty good progress by allowing file names to be unicode
>     > objects more often than they would be in CPython 2, which usually
>     > returns them as bytes in some encoding that we may not know.
>     I've got
>     > the launcher to work properly, and straightened the logic in our
>     > printing of trace-backs and exceptions from Java. Unicode file names
>     > seems the way to go for Jython because:
>     >
>     >   1. Java gives us competently decoded unicode file names, from
>     >      java.io.File, etc.. Re-encoding the result will be a pain (and
>     >      overlooked).
>     >   2. We appear not to have the codec we need ('mbcs'), that CPython
>     >      reports on Windows via sys.getfilesystemencoding().
>     >   3. We do this already. In 2.7.0, os.getcwd() returns unicode
>     if necessary.
>     >
>     > Most regression tests pass. However, I'm struggling with
>     test_doctest.
>     > Problems arise when mixing unicode and bytes when one byte is
>     128 and
>     > over. This happens in ''.join(list) and formatted output like
>     "%s %s" %
>     > (ustr, bstr). The behaviour of these is identical with CPython: they
>     > raise UnicodeDecodeError because the bytes are promoted to
>     characters
>     > with a strict ascii interpretation. This happens a lot in
>     doctest.py and
>     > traceback.py, for example, where file paths and stack dumps that
>     include
>     > them, are now frequently unicode, while other inputs are byte data
>     > containing file paths presented in the console encoding.
>     >
>     > I can beat this into submission with enough customisation of the
>     stdlib
>     > modules, but that always makes me uncomfortable. I usually see
>     that as a
>     > hint that user code might also need to change. This may be
>     unfounded. I
>     > can probably ensure no impact to users of only ascii paths, and the
>     > others seem unable to run Jython at all (in the scope of this
>     issue).
>     > However, I'm seriously wondering if I should pursue the approach
>     where
>     > file names from Java are re-encoded to bytes (maybe as utf-8
>     > everywhere), but that's grim.
>     >
>     > Thoughts?
>     >
>
>
>     ------------------------------------------------------------------------------
>     Check out the vibrant tech community on one of the world's most
>     engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>     _______________________________________________
>     Jython-dev mailing list
>     Jyt...@li...
>     <mailto:Jyt...@li...>
>     https://lists.sourceforge.net/lists/listinfo/jython-dev
>