From: Paul D. F. <pdf...@ku...> - 2005-11-18 16:35:53
|
Frank- Thanks for the definitive answers. I tested with that bug fix (and without my hack to pickle.py), and here=20 are some results. Essentially, with the fix, pickle now seems to work for str type with (or= =20 without) unicode characters, but strings declared as unicode using=20 u"string" don't unpickle correctly (unless using the codecs module to ope= n=20 the file ats utf-8). For str types, pickle just saves and loads the string as binary data (not= =20 sure how portable that realy is?), whereas for unicode strings pickle doe= s=20 something a bit more complicated with 'raw-unicode-escape' (which I don't= =20 fully understand yet). Anyway, so there remains an inconsistency; with the bug fix, pickle on a = regularly opened file will now work for strings with unicode characters -= -=20 as long as they are not PyUnicode strings (which are garbled unless the=20 reading file is opened using codecs). My testing code: import types print dir(types) print "StringType =3D=3D UnicodeType", types.StringType =3D=3D types= =2EUnicodeType print "1", type("hello") print "2", type(u"hello") print "3", type(u"\xE1hello") print "4", type("\xE1hello") print "5", type(unicode("hello")) import pickle print "pickle test with explicit unicode" pickledData =3D pickle.dumps(u"=E1=E9=ED=F3=FA") print pickle.loads(pickledData) print "pickle test with implicit unicode" pickledData =3D pickle.dumps("=E1=E9=ED=F3=FA") print pickle.loads(pickledData) # print "trying tests with str type with unicode data" test_unicode_object =3D "=E1=E9=ED=F3=FA" test_filename =3D "pickle_test.dat" print "testing regular read (should work)" f =3D open(test_filename, "w") pickle.dump(test_unicode_object, f) f.close() f =3D open(test_filename) result =3D pickle.load(f) f.close() print test_unicode_object =3D=3D result, test_unicode_object, result= print "testing read using codecs and utf-8 (should work)" import codecs f =3D codecs.open(test_filename, "r", "utf-8") result =3D pickle.load(f) f.close() print test_unicode_object =3D=3D result, test_unicode_object, result= # print "trying tests with unicode type with uncode data" test_unicode_object =3D u"=E1=E9=ED=F3=FA" test_filename =3D "pickle_test.dat" print "testing regular read (will not work properly)" f =3D open(test_filename, "w") pickle.dump(test_unicode_object, f) f.close() f =3D open(test_filename) result =3D pickle.load(f) f.close() print test_unicode_object =3D=3D result, test_unicode_object, result= print "testing read using codecs and utf-8 (should work)" import codecs f =3D codecs.open(test_filename, "r", "utf-8") result =3D pickle.load(f) f.close() print test_unicode_object =3D=3D result, test_unicode_object, result= The results (with the bug fix in place (and not my hack): ['ArrayType', 'BuiltinFunctionType', 'BuiltinMethodType', 'ClassType',=20 'CodeType', 'ComplexType', 'DictType', 'DictionaryType', 'EllipsisType', = 'FileType', 'FloatType', 'FrameType', 'FunctionType', 'GeneratorType',=20 'InstanceType', 'IntType', 'LambdaType', 'ListType', 'LongType',=20 'MethodType', 'ModuleType', 'NoneType', 'SliceType', 'StringType',=20 'StringTypes', 'TracebackType', 'TupleType', 'TypeType',=20 'UnboundMethodType', 'UnicodeType', 'XRangeType', '__doc__', 'classDictIn= it'] StringType =3D=3D UnicodeType 0 1 <type 'str'> 2 <type 'unicode'> 3 <type 'unicode'> 4 <type 'str'> 5 <type 'unicode'> pickle test with explicit unicode =E1=E9=ED=F3=FA pickle test with implicit unicode =E1=E9=ED=F3=FA trying tests with str type with unicode data testing regular read (should work) 1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA testing read using codecs and utf-8 (should work) 1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA trying tests with unicode type with unicode data testing regular read (will not work properly) 0 =E1=E9=ED=F3=FA =C3=A1=C3=A9=C3=AD=C3=B3=C3=BA testing read using codecs and utf-8 (should work) 1 =E1=E9=ED=F3=FA =E1=E9=ED=F3=FA [By the way, for reference, even with the fix, jythonc still produces a=20 "Exception: Unhandled node Unicode[s=3Dhello]" if there is a u"hello" in = the=20 source code when I try it. And when I package the bug fixed version (not = using codecs), I now get a "LookupError: unknown encoding=20 raw-unicode-escape" related to pickling, so something funky is still goin= g=20 on there with jythonc not pulling in unicode related stuff in my setup.] Thanks. --Paul Fernhout Frank Wierzbicki wrote: >>In "types.java", here is related code: >> dict.__setitem__("StringType", PyType.fromClass(PyString.class)); >> ... >> dict.__setitem__("UnicodeType", PyType.fromClass(PyString.class)); >>So they are indeed set to be the same. >=20 > This is a bug, it should be PyUnicode.class >=20 >=20 >>Anyway, is there a definitive answer on whether Jython 2.2 is supposed >>have different types for Unicode and ASCII strings, or are they still >>supposed to be the same? >=20 > PyUnicode is a recent addition -- it is being included to support > better CPython compatibility. It is trickery though, since internally > all strings *are* unicode, but <type 'unicode'> is showing up in alot > of CPython lately, so it is probably a neccessary evil until CPython > starts internally storing strings as unicode. The BDFL wants to do > this eventually, but I don't think it will happen anytime soon. >=20 > -Frank |