From: Jeff A. <ja...@fa...> - 2013-10-26 18:07:04
|
In adding buffer API arguments to methods in PyString, it is also necessary to preserve what happens when a PyUnicode is supplied. As I study this, I conclude that our support for the basic multilingual plane is good, whether in PyUnicode directly or in the PyString mixed case. In the mixed case e.g. "**hello**".strip(u'*') the target string mostly delegates to a unicode copy of itself, effectively "**hello**".decode('ascii').strip(u'*') and I continue to use this idiom. Beyond the BMP, the behaviour of PyUnicode is often incorrect, using an implementation shared with PyString. I can see separate arrangements sometimes made in PyUnicode, but for the most part we just treat the UTF-16 implementation units as characters. Is that the accepted state? As I work on the buffer interface, I make various incidental improvements, but it would be wrong to tackle this one as an 'incidental'. Should I however attempt to trap non-BMP strings (e.g. assert(s.isBasicPlane())? Or just continue quietly get it wrong, as in the following current behaviour: >>> s = u"\U00010000a" >>> len(s) 2 # good >>> s[1] u'a' # good >>> s.index('a') 2 # oops Jeff |
From: Jeff A. <ja...@fa...> - 2013-10-27 09:27:09
|
Ok, so it's not an accepted situation. Thanks for the swift reply. I logged a bug as we should come back to it after the buffer work is done, rather than tangle with it now. I'll continue with the buffer API changes within the BMP limitations noted. We could pull in more of the relevant 2.7 tests (or even bits of Py3k) avoiding those things that are just artefacts of the implementation differences. Jeff Jeff Allen On 27/10/2013 03:00, Jim Baker wrote: > ... > > On Sat, Oct 26, 2013 at 12:06 PM, Jeff Allen <ja...@fa... > <mailto:ja...@fa...>> wrote: > > ... > > As I work on the buffer interface, I make various incidental > improvements, but it would be wrong to tackle this one as an > 'incidental'. Should I however attempt to trap non-BMP strings (e.g. > assert(s.isBasicPlane())? Or just continue quietly get it wrong, as in > the following current behaviour: > >>> s = u"\U00010000a" > >>> len(s) > 2 # good > >>> s[1] > u'a' # good > >>> s.index('a') > 2 # oops > > > That's a bug in the index method! So please fix as you have a chance. > ... > > - Jim |