From: Jeff A. <ja...@fa...> - 2012-12-21 20:16:24
|
On 21/12/2012 13:12, Alan Kennedy wrote: > [Jeff] > > Would it be fair to say that the character codes in a Jython > > PyString.string member should always be in the range 0..255 inclusive? > > If the string contains an encoded string, i.e. a string that has been > encoded into a series of bytes for storage or some other form of IO, > then yes, the values will all be in the range 0..255. > > You may find this email that I wrote back in the WSGI days to be useful. > > http://mail.python.org/pipermail/web-sig/2004-September/000858.html > > [Jeff] > > Apart from having to forego Java's lovely String methods, we wish we'd > > used an array of bytes implementation for PyString: right? > > Right. > > Jython's use of a java.lang.String to contain bytes is a hangover from > emulating cpython 1.x and 2.x, where strings have a dual nature and > can contain characters or bytes. > > Since this was a great source of confusion for users, cpython 3.x did > away with the dual nature and changed to have separate string and > bytes types, which can only be transformed into the other with an > encode or decode operation. > > http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit > > So when jython moves to 3.x, we'll have to do the same. > I found that post but the WSGI part was over my head and I wasn't sure quite what version of Jython we might be discussing. The situation is much clearer in Python 3.x, as in your link, and if I want to understand codecs I've learned I must read the 3.x docs then shift into the demotic. I believe I've tracked my current test failure down to discarding the rest of the string being converted after a decoding error is caught, but I'm not sure where that happens, or rather where keeping proper track isn't happening. That and the fact that my default encoding happens to be cp1252, hence quite happy with "\xc3\xa9": you would think a test suite would do something to define any variable configuration. The books are clear that Jython 2.5 str is 8-bit data "as in CPython", 2.x that is, but I was wondering in what other corners Unicode use of PyString might still lurk. Does it follow from that and what you said that any such use ought to be carefully swapped for PyUnicode? Have I guessed correctly that PyUnicode uses UTF-16 internally (with surrogate pairs) while pretending externally to be full-width? Jeff |