From: Jeff Allen <ja.py@fa...> - 2013-10-26 18:07:04
In adding buffer API arguments to methods in PyString, it is also
necessary to preserve what happens when a PyUnicode is supplied.
As I study this, I conclude that our support for the basic multilingual
plane is good, whether in PyUnicode directly or in the PyString mixed
case. In the mixed case e.g.
the target string mostly delegates to a unicode copy of itself, effectively
and I continue to use this idiom.
Beyond the BMP, the behaviour of PyUnicode is often incorrect, using an
implementation shared with PyString. I can see separate arrangements
sometimes made in PyUnicode, but for the most part we just treat the
UTF-16 implementation units as characters. Is that the accepted state?
As I work on the buffer interface, I make various incidental
improvements, but it would be wrong to tackle this one as an
'incidental'. Should I however attempt to trap non-BMP strings (e.g.
assert(s.isBasicPlane())? Or just continue quietly get it wrong, as in
the following current behaviour:
>>> s = u"\U00010000a"
2 # good
u'a' # good
2 # oops
Ok, so it's not an accepted situation. Thanks for the swift reply.
I logged a bug as we should come back to it after the buffer work is
done, rather than tangle with it now. I'll continue with the buffer API
changes within the BMP limitations noted. We could pull in more of the
relevant 2.7 tests (or even bits of Py3k) avoiding those things that are
just artefacts of the implementation differences.
On 27/10/2013 03:00, Jim Baker wrote:
> On Sat, Oct 26, 2013 at 12:06 PM, Jeff Allen <ja.py@...
> <mailto:ja.py@...>> wrote:
> As I work on the buffer interface, I make various incidental
> improvements, but it would be wrong to tackle this one as an
> 'incidental'. Should I however attempt to trap non-BMP strings (e.g.
> assert(s.isBasicPlane())? Or just continue quietly get it wrong, as in
> the following current behaviour:
> >>> s = u"\U00010000a"
> >>> len(s)
> 2 # good
> >>> s
> u'a' # good
> >>> s.index('a')
> 2 # oops
> That's a bug in the index method! So please fix as you have a chance.
> - Jim