From: Jeff A. <ja...@fa...> - 2012-12-21 09:59:51
|
Would it be fair to say that the character codes in a Jython PyString.string member should always be in the range 0..255 inclusive? I'm working on failures and skips in the test_io, and the failure in CTextIOWrapperTest.test_initialization() looks like it is to do with codecs, both the choice of a default one on my machine and the behaviour of the "ascii" codec once I force it in. Codecs confuse me, and more in Jython's innards than in Python code (Python 3 anyway), because it is less clear when some String/str is supposed to hold bytes and when real characters. I came across this in PyByteArray and breathed a huge sigh of relief once the tests passed, without ever being really sure I'd done it right. Apart from having to forego Java's lovely String methods, we wish we'd used an array of bytes implementation for PyString: right? Jeff |
From: Alan K. <jyt...@xh...> - 2012-12-21 13:12:51
|
[Jeff] > Would it be fair to say that the character codes in a Jython > PyString.string member should always be in the range 0..255 inclusive? If the string contains an encoded string, i.e. a string that has been encoded into a series of bytes for storage or some other form of IO, then yes, the values will all be in the range 0..255. You may find this email that I wrote back in the WSGI days to be useful. http://mail.python.org/pipermail/web-sig/2004-September/000858.html [Jeff] > Apart from having to forego Java's lovely String methods, we wish we'd > used an array of bytes implementation for PyString: right? Right. Jython's use of a java.lang.String to contain bytes is a hangover from emulating cpython 1.x and 2.x, where strings have a dual nature and can contain characters or bytes. Since this was a great source of confusion for users, cpython 3.x did away with the dual nature and changed to have separate string and bytes types, which can only be transformed into the other with an encode or decode operation. http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit So when jython moves to 3.x, we'll have to do the same. Alan. |
From: Jeff A. <ja...@fa...> - 2012-12-21 20:16:24
|
On 21/12/2012 13:12, Alan Kennedy wrote: > [Jeff] > > Would it be fair to say that the character codes in a Jython > > PyString.string member should always be in the range 0..255 inclusive? > > If the string contains an encoded string, i.e. a string that has been > encoded into a series of bytes for storage or some other form of IO, > then yes, the values will all be in the range 0..255. > > You may find this email that I wrote back in the WSGI days to be useful. > > http://mail.python.org/pipermail/web-sig/2004-September/000858.html > > [Jeff] > > Apart from having to forego Java's lovely String methods, we wish we'd > > used an array of bytes implementation for PyString: right? > > Right. > > Jython's use of a java.lang.String to contain bytes is a hangover from > emulating cpython 1.x and 2.x, where strings have a dual nature and > can contain characters or bytes. > > Since this was a great source of confusion for users, cpython 3.x did > away with the dual nature and changed to have separate string and > bytes types, which can only be transformed into the other with an > encode or decode operation. > > http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit > > So when jython moves to 3.x, we'll have to do the same. > I found that post but the WSGI part was over my head and I wasn't sure quite what version of Jython we might be discussing. The situation is much clearer in Python 3.x, as in your link, and if I want to understand codecs I've learned I must read the 3.x docs then shift into the demotic. I believe I've tracked my current test failure down to discarding the rest of the string being converted after a decoding error is caught, but I'm not sure where that happens, or rather where keeping proper track isn't happening. That and the fact that my default encoding happens to be cp1252, hence quite happy with "\xc3\xa9": you would think a test suite would do something to define any variable configuration. The books are clear that Jython 2.5 str is 8-bit data "as in CPython", 2.x that is, but I was wondering in what other corners Unicode use of PyString might still lurk. Does it follow from that and what you said that any such use ought to be carefully swapped for PyUnicode? Have I guessed correctly that PyUnicode uses UTF-16 internally (with surrogate pairs) while pretending externally to be full-width? Jeff |
From: Jeff A. <ja...@fa...> - 2012-12-22 00:06:58
|
On 21/12/2012 20:15, Jeff Allen wrote: > On 21/12/2012 13:12, Alan Kennedy wrote: >> [Jeff] >>> Would it be fair to say that the character codes in a Jython >>> PyString.string member should always be in the range 0..255 inclusive? >> If the string contains an encoded string, i.e. a string that has been >> encoded into a series of bytes for storage or some other form of IO, >> then yes, the values will all be in the range 0..255. >> >> You may find this email that I wrote back in the WSGI days to be useful. >> >> http://mail.python.org/pipermail/web-sig/2004-September/000858.html >> >> [Jeff] >>> Apart from having to forego Java's lovely String methods, we wish we'd >>> used an array of bytes implementation for PyString: right? >> Right. >> >> Jython's use of a java.lang.String to contain bytes is a hangover from >> emulating cpython 1.x and 2.x, where strings have a dual nature and >> can contain characters or bytes. >> >> Since this was a great source of confusion for users, cpython 3.x did >> away with the dual nature and changed to have separate string and >> bytes types, which can only be transformed into the other with an >> encode or decode operation. >> >> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit >> >> So when jython moves to 3.x, we'll have to do the same. >> > I found that post but the WSGI part was over my head and I wasn't sure > quite what version of Jython we might be discussing. The situation is > much clearer in Python 3.x, as in your link, and if I want to understand > codecs I've learned I must read the 3.x docs then shift into the demotic. > > I believe I've tracked my current test failure down to discarding the > rest of the string being converted after a decoding error is caught, but > I'm not sure where that happens, or rather where keeping proper track > isn't happening. That and the fact that my default encoding happens to > be cp1252, hence quite happy with "\xc3\xa9": you would think a test > suite would do something to define any variable configuration. > > The books are clear that Jython 2.5 str is 8-bit data "as in CPython", > 2.x that is, but I was wondering in what other corners Unicode use of > PyString might still lurk. Does it follow from that and what you said > that any such use ought to be carefully swapped for PyUnicode? Have I > guessed correctly that PyUnicode uses UTF-16 internally (with surrogate > pairs) while pretending externally to be full-width? > > Jeff Aaaaargh! It's got nothing to do with codecs. It's testing that a badly specified call to TextIOWrapper.__init__ leaves the object unusable. In the C implementation it does; in the pure Python stand-in it doesn't. I'll add a note ... Jeff |
From: Philip J. <pj...@un...> - 2012-12-22 23:05:16
|
On Dec 21, 2012, at 1:57 AM, Jeff Allen wrote: > Would it be fair to say that the character codes in a Jython > PyString.string member should always be in the range 0..255 inclusive? > > I'm working on failures and skips in the test_io, and the failure in > CTextIOWrapperTest.test_initialization() looks like it is to do with > codecs, both the choice of a default one on my machine and the behaviour > of the "ascii" codec once I force it in. > > Codecs confuse me, and more in Jython's innards than in Python code > (Python 3 anyway), because it is less clear when some String/str is > supposed to hold bytes and when real characters. I came across this in > PyByteArray and breathed a huge sigh of relief once the tests passed, > without ever being really sure I'd done it right. > > Apart from having to forego Java's lovely String methods, we wish we'd > used an array of bytes implementation for PyString: right? Yup. Though it wasn't always like this in case you were wondering. Before 2.5 Jython PyString and PyUnicode were basically the same: both backed by java.lang.String and both supporting unicode. That's because Jython's PyString had always provided the unicode support of Java to Python code, dating back to before CPython had even added the unicode type =] Then CPython added it, so Jython added PyUnicode as simply a subclass (internally) of PyString. Jython 2.5 finally made PyString act like a byte bucket for better CPython compatibility, but we never actually switched its backing from java.lang.String. If you're seeing something different it's a bug. Indeed for 2.7, ideally it would become an array of bytes, mostly to save memory. -- Philip Jenvey |