Re: [Jython-dev] Bytes, characters and codecs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 21/12/2012 13:12, Alan Kennedy wrote:
> [Jeff]
> > Would it be fair to say that the character codes in a Jython
> > PyString.string member should always be in the range 0..255 inclusive?
>
> If the string contains an encoded string, i.e. a string that has been 
> encoded into a series of bytes for storage or some other form of IO, 
> then yes, the values will all be in the range 0..255.
>
> You may find this email that I wrote back in the WSGI days to be useful.
>
> http://mail.python.org/pipermail/web-sig/2004-September/000858.html
>
> [Jeff]
> > Apart from having to forego Java's lovely String methods, we wish we'd
> > used an array of bytes implementation for PyString: right?
>
> Right.
>
> Jython's use of a java.lang.String to contain bytes is a hangover from 
> emulating cpython 1.x and 2.x, where strings have a dual nature and 
> can contain characters or bytes.
>
> Since this was a great source of confusion for users, cpython 3.x did 
> away with the dual nature and changed to have separate string and 
> bytes types, which can only be transformed into the other with an 
> encode or decode operation.
>
> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
>
> So when jython moves to 3.x, we'll have to do the same.
>
I found that post but the WSGI part was over my head and I wasn't sure 
quite what version of Jython we might be discussing. The situation is 
much clearer in Python 3.x, as in your link, and if I want to understand 
codecs I've learned I must read the 3.x docs then shift into the demotic.

I believe I've tracked my current test failure down to discarding the 
rest of the string being converted after a decoding error is caught, but 
I'm not sure where that happens, or rather where keeping proper track 
isn't happening. That and the fact that my default encoding happens to 
be cp1252, hence quite happy with "\xc3\xa9": you would think a test 
suite would do something to define any variable configuration.

The books are clear that Jython 2.5 str is 8-bit data "as in CPython", 
2.x that is, but I was wondering in what other corners Unicode use of 
PyString might still lurk. Does it follow from that and what you said 
that any such use ought to be carefully swapped for PyUnicode? Have I 
guessed correctly that PyUnicode uses UTF-16 internally (with surrogate 
pairs) while pretending externally to be full-width?

Jeff