From: Arye <ar...@bi...> - 2007-06-29 15:53:31
|
SGVsbG8gYWxsIQoKVGhlIGxpdHRsZSBwcm9ncmFtIGJlbG93IGJlaGF2ZXMgZGlmZmVyZW50bHkg d2hlbiBydW4gYnkganl0aG9uIGFuZCBQeXRob246CkkgYW0gdHJ5aW5nIHRvIGVuY29kZSBpbiB1 dGYtOCBhIHVuaWNvZGUgc3RyaW5nIHdpdGggMyBjaGFyYWN0ZXJzIGluIGl0Ogp1IihyKeKEosKw IiAgVGhlICJSZWdpc3RlcmVkIiwgIlRyYWRlIE1hcmsiLCBhbmQgIkRlZ3JlZXMiIGNoYXJhY3Rl cnMuCgoKKioqKioqKioqKioqKipzaW1wbGVfZW5jb2RpbmdfdGVzdF8ucHkgc3RhcnQqKioqKioq KioqCmltcG9ydCBzeXMKc3lzLmRlZmF1bHRlbmNvZGluZyAgPSAnbGF0aW4tMScKCm15ZnVubnlj aGFycyA9IHUiKHIp4oSiwrAiCgpwcmludCAic3lzLmdldGRlZmF1bHRlbmNvZGluZygpPSIsc3lz LmdldGRlZmF1bHRlbmNvZGluZygpCgpteV91dGY4ID0gbXlmdW5ueWNoYXJzLmVuY29kZSgidXRm LTgiKQpwcmludCAibGVuKG15X3V0ZjgpPSIsbGVuKG15X3V0ZjgpCnByaW50ICJteV91dGY4PSIs bXlfdXRmOAoKZm9yIGMgaW4gbXlfdXRmODoKICAgIHByaW50IG9yZChjKQoqKioqKioqKioqKioq KnNpbXBsZV9lbmNvZGluZ190ZXN0Xy5weSBlbmQqKioqKioqKioqCgpJIG5vdGljZSB0aGF0IENQ eXRob24gZW5jb2RlcyB0aGlzIGNoYXJhY3RlciBhcyA2IGNoYXJhY3RlcnMgaW4gdXRmLTg6CkM6 XEFIXFdPUktcVVRJTD5weXRob24gc2ltcGxlX2VuY29kaW5nX3Rlc3QucHkKc3lzLmdldGRlZmF1 bHRlbmNvZGluZygpPSBsYXRpbi0xCmxlbihteV91dGY4KT0gNgpteV91dGY4PSDilKzCq+KUrMOW 4pSs4paRCjE5NAoxNzQKMTk0CjE1MwoxOTQKMTc2CgoKT24gdGhlIG90aGVyIGhhbmQsIGp5dGhv biBjb252ZXJ0cyB0aGlzIGluIDcgY2hhcmFjdGVyIChzb21lIGFyZSB0aGUgc2FtZSwKc29tZSBh cmUgZGlmZmVyZW50KToKQzpcQUhcV09SS1xVVElMPmp5dGhvbiBzaW1wbGVfZW5jb2RpbmdfdGVz dC5weQpzeXMuZ2V0ZGVmYXVsdGVuY29kaW5nKCk9IGxhdGluLTEKbGVuKG15X3V0ZjgpPSA3Cm15 X3V0Zjg9IOKUrMKrw5Q/w7PilKzilpEKMTk0CjE3NAoyMjYKMTMyCjE2MgoxOTQKMTc2CgpBbnkg ZXhwbGFuYXRpb24gb24gd2h5IHRoZSBvdXRwdXQgb2Yganl0aG9uIGlzIGRpZmZlcmVudCB3b3Vs ZCBiZQpncmVhdGx5IGFwcHJlY2lhdGVkICEhCkFyeWUuCg== |
From: Alan K. <jyt...@xh...> - 2007-06-30 16:27:28
|
[Arye] > The little program below behaves differently when run by jython and Python: > I am trying to encode in utf-8 a unicode string with 3 characters in it: > u"(r)™°" The "Registered", "Trade Mark", and "Degrees" characters. The fundamental problem here is that jython does not support PEP 263, which permits you to declare the encoding of your source module. Because you have embedded your funny characters directly in your source, jython does not interpret them correctly. Cpython will only interpret them correctly if you have an encoding declaration at the top of your source file, like so # -*- coding: utf-8 -*- (Change the encoding to whatever your text editor produces) Defining Python Source Code Encodings http://www.python.org/dev/peps/pep-0263/ There are two solutions to your problem 1. Do not embed the raw characters directly in your source file. Instead, declare them with unicode escapes, like so myfunnychars = u"\u00b0\u00ae\u2122" 2. Keep all unicode strings in separate files to your source, and read them with codecs.open. Try running this code in both cpython and jython; you should get identical results from both # -=-=-=-=-=-=-=-=-= import sys print "sys.getdefaultencoding()=", sys.getdefaultencoding() myfunnychars = u"\u00b0\u00ae\u2122" my_utf8 = myfunnychars.encode("utf-8") print "len(my_utf8)=",len(my_utf8) print "my_utf8=",my_utf8 for c in my_utf8: print ord(c) # -=-=-=-=-=-=-=-=-= Regards, Alan. |