Re: [Jython-dev] Bytes, characters and codecs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 21/12/2012 20:15, Jeff Allen wrote:
> On 21/12/2012 13:12, Alan Kennedy wrote:
>> [Jeff]
>>> Would it be fair to say that the character codes in a Jython
>>> PyString.string member should always be in the range 0..255 inclusive?
>> If the string contains an encoded string, i.e. a string that has been
>> encoded into a series of bytes for storage or some other form of IO,
>> then yes, the values will all be in the range 0..255.
>>
>> You may find this email that I wrote back in the WSGI days to be useful.
>>
>> http://mail.python.org/pipermail/web-sig/2004-September/000858.html
>>
>> [Jeff]
>>> Apart from having to forego Java's lovely String methods, we wish we'd
>>> used an array of bytes implementation for PyString: right?
>> Right.
>>
>> Jython's use of a java.lang.String to contain bytes is a hangover from
>> emulating cpython 1.x and 2.x, where strings have a dual nature and
>> can contain characters or bytes.
>>
>> Since this was a great source of confusion for users, cpython 3.x did
>> away with the dual nature and changed to have separate string and
>> bytes types, which can only be transformed into the other with an
>> encode or decode operation.
>>
>> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
>>
>> So when jython moves to 3.x, we'll have to do the same.
>>
> I found that post but the WSGI part was over my head and I wasn't sure
> quite what version of Jython we might be discussing. The situation is
> much clearer in Python 3.x, as in your link, and if I want to understand
> codecs I've learned I must read the 3.x docs then shift into the demotic.
>
> I believe I've tracked my current test failure down to discarding the
> rest of the string being converted after a decoding error is caught, but
> I'm not sure where that happens, or rather where keeping proper track
> isn't happening. That and the fact that my default encoding happens to
> be cp1252, hence quite happy with "\xc3\xa9": you would think a test
> suite would do something to define any variable configuration.
>
> The books are clear that Jython 2.5 str is 8-bit data "as in CPython",
> 2.x that is, but I was wondering in what other corners Unicode use of
> PyString might still lurk. Does it follow from that and what you said
> that any such use ought to be carefully swapped for PyUnicode? Have I
> guessed correctly that PyUnicode uses UTF-16 internally (with surrogate
> pairs) while pretending externally to be full-width?
>
> Jeff
Aaaaargh! It's got nothing to do with codecs. It's testing that a badly 
specified call to TextIOWrapper.__init__ leaves the object unusable. In 
the C implementation it does; in the pure Python stand-in it doesn't. 
I'll add a note ...

Jeff