Re: [Jython-dev] Re: pickling unicode strings in jython-2.2a1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Frank-

Thanks for the definitive answers.

I tested with that bug fix (and without my hack to pickle.py), and here=20
are some results.

Essentially, with the fix, pickle now seems to work for str type with (or=
=20
without) unicode characters, but strings declared as unicode using=20
u"string" don't unpickle correctly (unless using the codecs module to ope=
n=20
the file ats utf-8).

For str types, pickle just saves and loads the string as binary data (not=
=20
sure how portable that realy is?), whereas for unicode strings pickle doe=
s=20
something a bit more complicated with 'raw-unicode-escape' (which I don't=
=20
fully understand yet).

Anyway, so there remains an inconsistency; with the bug fix, pickle on a =

regularly opened file will now work for strings with unicode characters -=
-=20
as long as they are not PyUnicode strings (which are garbled unless the=20
reading file is opened using codecs).

My testing code:

     import types
     print dir(types)
     print "StringType =3D=3D UnicodeType", types.StringType =3D=3D types=
=2EUnicodeType
     print "1", type("hello")
     print "2", type(u"hello")
     print "3", type(u"\xE1hello")
     print "4", type("\xE1hello")
     print "5", type(unicode("hello"))
     import pickle
     print "pickle test with explicit unicode"
     pickledData =3D pickle.dumps(u"=E1=E9=ED=F3=FA")
     print pickle.loads(pickledData)
     print "pickle test with implicit unicode"
     pickledData =3D pickle.dumps("=E1=E9=ED=F3=FA")
     print pickle.loads(pickledData)
     #
     print "trying tests with str type with unicode data"
     test_unicode_object =3D "=E1=E9=ED=F3=FA"
     test_filename =3D "pickle_test.dat"
     print "testing regular read (should work)"
     f =3D open(test_filename, "w")
     pickle.dump(test_unicode_object, f)
     f.close()
     f =3D open(test_filename)
     result =3D pickle.load(f)
     f.close()
     print test_unicode_object =3D=3D result, test_unicode_object, result=

     print "testing read using codecs and utf-8 (should work)"
     import codecs
     f =3D codecs.open(test_filename, "r", "utf-8")
     result =3D pickle.load(f)
     f.close()
     print test_unicode_object =3D=3D result, test_unicode_object, result=

     #
     print "trying tests with unicode type with uncode data"
     test_unicode_object =3D u"=E1=E9=ED=F3=FA"
     test_filename =3D "pickle_test.dat"
     print "testing regular read (will not work properly)"
     f =3D open(test_filename, "w")
     pickle.dump(test_unicode_object, f)
     f.close()
     f =3D open(test_filename)
     result =3D pickle.load(f)
     f.close()
     print test_unicode_object =3D=3D result, test_unicode_object, result=

     print "testing read using codecs and utf-8 (should work)"
     import codecs
     f =3D codecs.open(test_filename, "r", "utf-8")
     result =3D pickle.load(f)
     f.close()
     print test_unicode_object =3D=3D result, test_unicode_object, result=

The results (with the bug fix in place (and not my hack):

['ArrayType', 'BuiltinFunctionType', 'BuiltinMethodType', 'ClassType',=20
'CodeType', 'ComplexType', 'DictType', 'DictionaryType', 'EllipsisType', =

'FileType', 'FloatType', 'FrameType', 'FunctionType', 'GeneratorType',=20
'InstanceType', 'IntType', 'LambdaType', 'ListType', 'LongType',=20
'MethodType', 'ModuleType', 'NoneType', 'SliceType', 'StringType',=20
'StringTypes', 'TracebackType', 'TupleType', 'TypeType',=20
'UnboundMethodType', 'UnicodeType', 'XRangeType', '__doc__', 'classDictIn=
it']
StringType =3D=3D UnicodeType 0
1 <type 'str'>
2 <type 'unicode'>
3 <type 'unicode'>
4 <type 'str'>
5 <type 'unicode'>
pickle test with explicit unicode
=E1=E9=ED=F3=FA
pickle test with implicit unicode
=E1=E9=ED=F3=FA
trying tests with str type with unicode data
testing regular read (should work)
1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA
testing read using codecs and utf-8 (should work)
1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA
trying tests with unicode type with unicode data
testing regular read (will not work properly)
0 =E1=E9=ED=F3=FA =C3=A1=C3=A9=C3=AD=C3=B3=C3=BA
testing read using codecs and utf-8 (should work)
1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA

[By the way, for reference, even with the fix, jythonc still produces a=20
"Exception: Unhandled node Unicode[s=3Dhello]" if there is a u"hello" in =
the=20
source code when I try it. And when I package the bug fixed version (not =

using codecs), I now get a "LookupError: unknown encoding=20
raw-unicode-escape" related to pickling, so something funky is still goin=
g=20
on there with jythonc not pulling in unicode related stuff in my setup.]

Thanks.

--Paul Fernhout

Frank Wierzbicki wrote:
>>In "types.java", here is related code:
>>   dict.__setitem__("StringType", PyType.fromClass(PyString.class));
>>   ...
>>   dict.__setitem__("UnicodeType", PyType.fromClass(PyString.class));
>>So they are indeed set to be the same.
>=20
> This is a bug, it should be PyUnicode.class
>=20
>=20
>>Anyway, is there a definitive answer on whether Jython 2.2 is supposed
>>have different types for Unicode and ASCII strings, or are they still
>>supposed to be the same?
>=20
> PyUnicode is a recent addition -- it is being included to support
> better CPython compatibility.  It is trickery though, since internally
> all strings *are* unicode, but <type 'unicode'> is showing up in alot
> of CPython lately, so it is probably a neccessary evil until CPython
> starts internally storing strings as unicode.  The BDFL wants to do
> this eventually, but I don't think it will happen anytime soon.
>=20
> -Frank