Thread: [Jython-dev] PyString and PyUnicode beyond the BMP

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In adding buffer API arguments to methods in PyString, it is also 
necessary to preserve what happens when a PyUnicode is supplied.

As I study this, I conclude that our support for the basic multilingual 
plane is good, whether in PyUnicode directly or in the PyString mixed 
case. In the mixed case e.g.
     "**hello**".strip(u'*')
the target string mostly delegates to a unicode copy of itself, effectively
     "**hello**".decode('ascii').strip(u'*')
and I continue to use this idiom.

Beyond the BMP, the behaviour of PyUnicode is often incorrect, using an 
implementation shared with PyString. I can see separate arrangements 
sometimes made in PyUnicode, but for the most part we just treat the 
UTF-16 implementation units as characters. Is that the accepted state?

As I work on the buffer interface, I make various incidental 
improvements, but it would be wrong to tackle this one as an 
'incidental'. Should I however attempt to trap non-BMP strings (e.g. 
assert(s.isBasicPlane())? Or just continue quietly get it wrong, as in 
the following current behaviour:
 >>> s = u"\U00010000a"
 >>> len(s)
2            # good
 >>> s[1]
u'a'        # good
 >>> s.index('a')
2            # oops

Jeff

Thread: [Jython-dev] PyString and PyUnicode beyond the BMP

jython-dev