Re: [Jython-dev] Bytes, characters and codecs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Dec 21, 2012, at 1:57 AM, Jeff Allen wrote:

> Would it be fair to say that the character codes in a Jython 
> PyString.string member should always be in the range 0..255 inclusive?
> 
> I'm working on failures and skips in the test_io, and the failure in 
> CTextIOWrapperTest.test_initialization() looks like it is to do with 
> codecs, both the choice of a default one on my machine and the behaviour 
> of the "ascii" codec once I force it in.
> 
> Codecs confuse me, and more in Jython's innards than in Python code 
> (Python 3 anyway), because it is less clear when some String/str is 
> supposed to hold bytes and when real characters. I came across this in 
> PyByteArray and breathed a huge sigh of relief once the tests passed, 
> without ever being really sure I'd done it right.
> 
> Apart from having to forego Java's lovely String methods, we wish we'd 
> used an array of bytes implementation for PyString: right?

Yup.

Though it wasn't always like this in case you were wondering. Before 2.5 Jython PyString and PyUnicode were basically the same: both backed by java.lang.String and both supporting unicode. That's because Jython's PyString had always provided the unicode support of Java to Python code, dating back to before CPython had even added the unicode type =]

Then CPython added it, so Jython added PyUnicode as simply a subclass (internally) of PyString.

Jython 2.5 finally made PyString act like a byte bucket for better CPython compatibility, but we never actually switched its backing from java.lang.String. If you're seeing something different it's a bug.

Indeed for 2.7, ideally it would become an array of bytes, mostly to save memory.

--
Philip Jenvey