Re: [Jython-dev] Unicode user and file names (and v2.7.1)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hey Jeff,

Sounds good. Let's do another rc but to be honest I'm not even sure the RC
matters much if there aren't people trying it except us.

Thoughts?
Darjus

On Fri, May 19, 2017, 1:19 AM Jeff Allen <ja...@fa...> wrote:

> Hi Darjus.
>
> On inclusion, I'm happy to go with the community view, as always. On one
> of the related tickets (http://bugs.jython.org/issue1839), Jim said we'd
> get it in if timing allowed and there was some user support.
>
> I'm very keen to see a 2.7.1 too. The last (soft) RC was unsuccessful, and
> we're still making changes, so I assume we're talking about another RC
> first rather than a release?
>
> The UTF-8 work is nearly there, but not quite: one Linux defect to fix, as
> noted on the same issue by James against the "latin-1" version. After all
> the additions in the last couple of weeks (to get full BMP support), I'm
> happy to find from my Linux laptop that it is still the only thing I have
> to do. It looks trivial. I've been unable code at all for a few days, so
> haven't looked into a solution, but now I'm back I expect to nail it for us
> today or tomorrow.
>
> I can, of course, merge all this myself and will. I shared your hesitancy
> initially, hence the fork repository, but it's turned out so well I feel
> it's now low risk, as long as we still have a few days.
>
> I will now dive under the desk and wire up my Linux dev box.
>
> Jeff Allen
>
> On 16/05/2017 21:46, Darjus Loktevic wrote:
>
> Hey Jeff,
>
> It seems your last commit to this branch is of three days ago. Is this
> ready for review? BTW, your changes look good to me.
> I'm a little hesitant to merge this since we've had an RC and REALLY have
> to release 2.7.1 It's miles better than 2.7.0.
>
> Cheers,
> Darjus
>
> On Mon, May 1, 2017 at 6:34 AM Jeff Allen <ja...@fa...> wrote:
>
>> I went for sys.getfilesystemencoding() == 'utf-8' and it works pretty
>> well. Rather than just push directly I have published to here:
>>
>> https://bitbucket.org/tournesol/jython-utf8
>>
>> I write to ask for a second or third pair of eyes on it. Please tell me
>> you can see it and whether it breaks things you care about.
>>
>> I touched a lot of files in the core and import system: quite a lot of
>> tricky stuff with loaders and search paths has been adjusted. I think it
>> a good sign that I changed hardly anything in the standard library we
>> inherit from CPython, that we hadn't already specialised.
>>
>> By "works pretty well" above, I mean that the regression tests run
>> cleanly for me when my user name is "Épreuve", where previously Jython
>> died horribly. The launcher works from a Chinese user name too, as long
>> as I localise Windows to China (CPython 2.7 feature). I can use the
>> prompt and runs some tests with that setup, but I can't run the
>> regression test yet, and printing a stack dump is fatal, so there's a
>> bit more to do for Chinese.
>>
>> I think this means we have solid support for "latin-1" languages, but
>> there are still places where we fatally assume bytes are Unicode code
>> points.
>>
>> Jeff Allen
>>
>> On 05/04/2017 08:57, Jeff Allen wrote:
>> > I've been working on http://bugs.jython.org/issue2356 which I'd like to
>> > get in 2.7.1 -- it seems rather poor that Jython simply does not run for
>> > users whose names have an un-American character ;). I know this issue is
>> > not a blocker in most minds.
>> >
>> > I've made pretty good progress by allowing file names to be unicode
>> > objects more often than they would be in CPython 2, which usually
>> > returns them as bytes in some encoding that we may not know. I've got
>> > the launcher to work properly, and straightened the logic in our
>> > printing of trace-backs and exceptions from Java. Unicode file names
>> > seems the way to go for Jython because:
>> >
>> >   1. Java gives us competently decoded unicode file names, from
>> >      java.io.File, etc.. Re-encoding the result will be a pain (and
>> >      overlooked).
>> >   2. We appear not to have the codec we need ('mbcs'), that CPython
>> >      reports on Windows via sys.getfilesystemencoding().
>> >   3. We do this already. In 2.7.0, os.getcwd() returns unicode if
>> necessary.
>> >
>> > Most regression tests pass. However, I'm struggling with test_doctest.
>> > Problems arise when mixing unicode and bytes when one byte is 128 and
>> > over. This happens in ''.join(list) and formatted output like "%s %s" %
>> > (ustr, bstr). The behaviour of these is identical with CPython: they
>> > raise UnicodeDecodeError because the bytes are promoted to characters
>> > with a strict ascii interpretation. This happens a lot in doctest.py and
>> > traceback.py, for example, where file paths and stack dumps that include
>> > them, are now frequently unicode, while other inputs are byte data
>> > containing file paths presented in the console encoding.
>> >
>> > I can beat this into submission with enough customisation of the stdlib
>> > modules, but that always makes me uncomfortable. I usually see that as a
>> > hint that user code might also need to change. This may be unfounded. I
>> > can probably ensure no impact to users of only ascii paths, and the
>> > others seem unable to run Jython at all (in the scope of this issue).
>> > However, I'm seriously wondering if I should pursue the approach where
>> > file names from Java are re-encoded to bytes (maybe as utf-8
>> > everywhere), but that's grim.
>> >
>> > Thoughts?
>> >
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Jython-dev mailing list
>> Jyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/jython-dev
>>
>
>