|
From: Travis O. <oli...@ie...> - 2006-02-07 19:26:17
|
Gerard Vermeulen wrote: >>While I agree that this solution is more consistent, I must say that >>I'm not very confortable with having to deal with two different widths >>for unicode characters. >> Python itself hands us this difference. Is it really so different then the fact that python integers are either 32-bit or 64-bit depending on the platform. Perhaps what this is telling us, is that we do indeed need another data-type for 4-byte unicode. It's how we solve the problem of 32-bit or 64-bit integers (we have a 64-bit integer on all platforms). Then in NumPy we can support going back and forth between UCS-2 (which we can then say is UTF-16) and UCS-4. The issue with saving to disk is really one of encoding anyway. So, if PyTables want's do do this correctly, then it should be using a particular encoding anyway. The internal representation of Unicode should not technically matter as it's only input and output that is important. I won't support requiring a UCS-4 build of Python, though. That's too stringent. Most characters are contained within the 0th plane of UCS-2. For the additional characters (only up to 0x0010FFFF are defined), the surrogate pairs can be used. I think the best solution is to define separate UCS4 and UCS2 data-types and handle conversion between them using the casting functions. This is a bit of work to implement, but not too bad... >Wouldn't it be possible that numpy takes care of the "surrogate pairs" >when transferring unicode strings from UCS2-interpreters to UCS4-ndarrays >and vice-versa? > >It would be nice to be able to cast explicitly between UCS2- and UCS4- arrays, >too. > >Requesting users to recompile their Python is a rather brutal solution :-) > > I agree. I much prefer an additional data-type since that is after-all what UCS2 and UCS4 are... different data-types. -Travis |