From: Jeff A. <ja...@fa...> - 2012-12-22 00:06:58
|
On 21/12/2012 20:15, Jeff Allen wrote: > On 21/12/2012 13:12, Alan Kennedy wrote: >> [Jeff] >>> Would it be fair to say that the character codes in a Jython >>> PyString.string member should always be in the range 0..255 inclusive? >> If the string contains an encoded string, i.e. a string that has been >> encoded into a series of bytes for storage or some other form of IO, >> then yes, the values will all be in the range 0..255. >> >> You may find this email that I wrote back in the WSGI days to be useful. >> >> http://mail.python.org/pipermail/web-sig/2004-September/000858.html >> >> [Jeff] >>> Apart from having to forego Java's lovely String methods, we wish we'd >>> used an array of bytes implementation for PyString: right? >> Right. >> >> Jython's use of a java.lang.String to contain bytes is a hangover from >> emulating cpython 1.x and 2.x, where strings have a dual nature and >> can contain characters or bytes. >> >> Since this was a great source of confusion for users, cpython 3.x did >> away with the dual nature and changed to have separate string and >> bytes types, which can only be transformed into the other with an >> encode or decode operation. >> >> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit >> >> So when jython moves to 3.x, we'll have to do the same. >> > I found that post but the WSGI part was over my head and I wasn't sure > quite what version of Jython we might be discussing. The situation is > much clearer in Python 3.x, as in your link, and if I want to understand > codecs I've learned I must read the 3.x docs then shift into the demotic. > > I believe I've tracked my current test failure down to discarding the > rest of the string being converted after a decoding error is caught, but > I'm not sure where that happens, or rather where keeping proper track > isn't happening. That and the fact that my default encoding happens to > be cp1252, hence quite happy with "\xc3\xa9": you would think a test > suite would do something to define any variable configuration. > > The books are clear that Jython 2.5 str is 8-bit data "as in CPython", > 2.x that is, but I was wondering in what other corners Unicode use of > PyString might still lurk. Does it follow from that and what you said > that any such use ought to be carefully swapped for PyUnicode? Have I > guessed correctly that PyUnicode uses UTF-16 internally (with surrogate > pairs) while pretending externally to be full-width? > > Jeff Aaaaargh! It's got nothing to do with codecs. It's testing that a badly specified call to TextIOWrapper.__init__ leaves the object unusable. In the C implementation it does; in the pure Python stand-in it doesn't. I'll add a note ... Jeff |