From: Charlie G. <cha...@gm...> - 2009-01-10 12:31:12
|
On Thu, Jan 8, 2009 at 12:39 PM, Peter Bower <pet...@or...> wrote: > Given the following scenario: > > 1) assign a Japanese literal to a variable (in console or in py file) > > 2) print the variable > > 3) pass the variable to a java method The following Python module: j = u'\u521d\u671f' for c in j: print ord(c) import sys sys.setdefaultencoding('utf-8') print j import Test Test.print(j) and Java class: import java.io.PrintStream; import java.io.UnsupportedEncodingException; public class Test { public static void print (String val) throws UnsupportedEncodingException { for (int i = 0; i < val.length(); i++) { System.out.println((int)val.charAt(i)); } PrintStream utf8Stream = new PrintStream(System.out, true, "UTF-8"); utf8Stream.println(val); } } prints 21021 26399 初期 21021 26399 初期 on my terminal in Mac OS X(the third and sixth line may be garbled in this email, but they actually print out as the characters represented by \u521d\u671f, I swear:). > In Jython 2.1, it was very simple > > name = "<Japanese characters>" > > print name > > test.create(name) > > Everything works, it prints correctly, and the Java method gets the expected > string. The model appears > to be: > > - Literals are read with the default character set (or that of the > console encoding) > > - Strings can flow from Jython to Java and back without requiring > conversion > > - String are printed using the String.getBytes() method which encodes > using default character set This actually doesn't work in all cases, and is one of the reasons this was changed for 2.2. Java's default character doesn't always match the encoding of the console it's using e.g. the default encoding is MacRoman on Mac OS X, but the console uses utf-8 by default. That's why my Java source above makes its own PrintStream. System.out uses MacRoman and doesn't print properly to the console. This was particularly troublesome as Jython would read source files in the default encoding on one system, and if that source file was used on a system with a different default encoding, it would either explode or produce gibberish when the differences in encoding were encountered. The bigger reason for the change was to better conform to Python's Unicode model. Python has two "String" types, str and unicode. str is a byte string and is created by unadorned quotes. unicode is a sequence of unicode characters like Java's String and is created by prepending a u to the quotes. Allowing unicode characters in str as Jython 2.1 did lead to mismatches between CPython and Jython's models, and caused the unicode values in the strings to be truncated when various str operations were performed. Whenever you have character data, you want unicode objects and strings created with u''. > The 2.2.1 model appears to be > > - literals are read with the ISO-8859-1 character set from .py files and > by default in the console. > > - they flow from Jython to Java as is > > - strings are printed using the raw bytes (PyString.to_bytes()) This is correct. The encoding used to read from the interpreter is controlled with python.console.encoding, but otherwise things are assumed to be raw byte values. There's no way to have encoded unicode values in source files in Python 2.2. That was added by PEP 263 in Python 2.3. The only way to make unicode literals in 2.2 is with \u values for characters outside the ascii character set. > - is u' required? Does Jython 2.2.1 continue to support the non u' format? > Or should > unicode("japanese characters", "jp charset") be used instead (if a jp > charset was available)? Either u or calling unicode will work. If you have a large body of existing source, you can use something like the native2ascii tool that comes with Java to convert the encoded Japanese values into unicode escapes. If you need to do it dynamically, something like http://www.google.com/codesearch/p?hl=en#MzR-vajYaSo/kaffe-1.1.5/libraries/javalib/gnu/classpath/tools/native2ascii/Native2ASCII.java would work. > - should print <unicode variable> work out of the box? Or do we [and > customers] need to set the default encoding? Yes, you'll need to set Python's default encoding to the encoding that the console uses. I don't know of a way to do this across the Java platform. System.getProperty("file.encoding") returns Java's default encoding, but that doesn't always line up with what the console expects. > - what character set should Java methods expect the string to be in: > ("ISO-8859-1", the > default character set, or something else)? If you've got a unicode value in Python, the String will consist of the same unicode characters and no encoding is needed. If you have a str of encoded characters , the String will consist of chars of the same the same length in whatever encoding the str came in as. I'm sorry this transition is proving to be so painful; Jython's support for unicode was pretty broken in 2.1, and it'll finally work decently in 2.5 with the addition of PEP 263. |