Re: [Jython-users] Unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Finn Bock wrote:
> 
>...
> 
> Is a no-op a normal use? Maybe when porting CPython applications?

It wasn't a porting project when Brian ran into it. But maybe he was
thinking in CPython terms as he was coding.

Is it accurate to say that either way, the single-arg unicode() call is
useless in a made-for Jython application? If so, the question is how to
emulate CPython most directly. And then it is really a probabilistic
issue because you can never get 100% if CPython has two types where
Jython has only one. 

If you turn it into a no-op, you will have the effect in some cases of
making Latin1 the default encoding (which is what I have proposed for
CPython). Consider:

If this is a no-op:

>>> x=u'\x81'
>>> x=unicode(x)

Then so is this:

>>> x='\x81'
>>> x=unicode(x)

I have no problem with that, myself, but it is precisely the proposal
that caused the heated i18n flamewars.

>...
> 
> In jython the IMO obvious default would be the file.encoding property.
> Maybe I should have picked that default when I added unicode support,
> but after seeing the casualties of the that discussion on python-dev, I
> didn't dare.

Still, that wouldn't be a cure-all. In CPython, this would always be a
no-op:

unicode(unicode(unicode(unicode(u"..."))))

In Jython, it would decode according to the file.encoding several times,
potentially changing the string every time.

Perhaps the one-arg version of the unicode function should simply be
illegal (deprecated) when applied to strings. If you want to decode
according to file.encoding, you could. If you want to decode according
to sys.getdefaultencoding, you could. If you want to hard-code ASCII,
you could. It is probably an illusion to think that people can work with
Unicode without thinking about encodings anyhow. 

That's another reason I don't think that file.encoding and
sys.defaultencoding are particularly useful. Ten years ago it made sense
to guess at the file encoding based on the user's machine locale and OS.
Today, I don't think it does. When push comes to shove, the end-user
must specify the encoding of the data rather than expecting operating
systems or interpreters to guess based on the locale.

 Paul Prescod