From: Rose P. <ros...@or...> - 2008-12-18 00:37:31
|
Hi, Jython gurus: I need some help on running Jython 2.2.1 with multi-byte strings. Jython 2.2.1 cannot pass a unicode String correctly to a function defined in a py script. The value of the parameter is converted to different \x format. This is not happened in Jython 2.1. To reproduce it, define a py script, test.py file. The test.py file defines a function called create() which simply returns the value of the parameter: ======= start of test.py ====== def create(name): return name ======= end of test.py ===== Then start Jython 2.1 and run the function create() from the py file: java -classpath jython.jar.2.1 org.python.util.jython Jython 2.1 on java1.6.0_05 (JIT: null) execfile("test.py") create('\u4f7f\u7528') <-- input Japanese characters u'\u4F7F\u7528' <-- return the same unicode representing the Japanese characters with length 2 We can see the output of create function returns a two-byte unicode, which can be displayed correctly by Java System.out.println() method. Then we try Jython 2.2.1 with the same step: java -classpath jython.jar.2.2.1 org.python.util.jython Jython 2.2.1 on java1.6.0_05 execfile("test.py") create('\u4f7f\u7528') <-- input Japanese characters '\xBB\xC8\xCD\xD1' <-- returns different values with length 4. The \xBB\xC8\xCD\xD1 are not recognized by java so we always get "????" if use System.out.println() to print. This is a regression for Jython 2.2.1. This is going to affect all the customer written py files. Is there a workaround for this in Jython? Jython 2.5 seems to have the same issue. Thanks, Rose |
From: Charlie G. <cha...@gm...> - 2008-12-22 19:57:14
|
Hi Rose, On Wed, Dec 17, 2008 at 4:37 PM, Rose Pan <ros...@or...> wrote: > Jython 2.2.1 cannot pass a unicode String correctly to a function > defined in a py script. The value of the parameter is converted to > different \x format. You're actually not passing unicode strings to the function. To create a unicode string in Python, you prepend u to it. > create('\u4f7f\u7528') <-- input Japanese characters This should be changed to create(u'\u4f7f\u7528'). If I do that, I get a unicode string out of the create function using Jython 2.2. Jython 2.1's implementation had a bug in it that allowed unicode strings to be passed around in a byte string e.g. those created with str or quotes with no u, and it had undefined behavior when converting into real bytes. This was fixed in 2.2. Charlie |
From: Rose P. <ros...@or...> - 2009-01-11 08:49:51
|
Hi, Charlie, Thanks for the info. I understand the sample here is for passing a Japanese literal from a py file to a java method. This makes sense to me. I might have further question on this. Could you please show us an example while InteractiveInterpreter is embedded in Java? I am more interested in the following case: 1. Read a string from Java. The string can be either representing a variable setting (like a="multi-byte string") or a function name with parameters in a py file (like print("multi-byte string")). . 2. Pass the string to InteractiveInterpreter.runsource(string) method. 3. Jython invokes the function in the py file which simply passes the parameter (the Japanese literal for example) of the function to a java method. With the current model in jython 2.2.1, the string read from java needs to be changed to have a 'u' prepened before passing to InteractiveInterpreter.runsource() method , so the string can be passed correctly to the function in the py file and then to the java method. This is already showed in the sample. But the issue here is the string representing the variable setting can also be passed to InteractiveInterpreter.runsource() method with the 'u' prepended. When trying to show the value of the variable using the Jython print command, it will give us error. Jython 2.2.1 on java1.6.0_05 Type "copyright", "credits" or "license" for more information. >>> execfile("test.py") >>> testdo(u'\u4F7F\u7528') u'\u4F7F\u7528' >>> a=u'\u4F7F\u7528' >>> print a Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii encoding error: ordinal not in range(128) The print command in Jython only print the string with the format like "'\xBB\xC8\xCD\xD1'". So I might have two more questions: 1. Is there a way to handle both cases (setting variables and calling functions) when embedding InteractiveInterpreter in java? 2. Since the unicode characters read from java can not be directly passed to InteractiveInterpreter.runsource(), it has to convert to jython unicode string. Is there a convenient method in Jython to convert java string into jython unicode string that we can call in java code, so the 'u' can be prepended at the beginning of the multi-byte string, not the beginning of the whole string? If there is a sample for two of the cases, that would be great. Really appreciate your help. Thanks, Rose Charlie Groves wrote: > On Thu, Jan 8, 2009 at 12:39 PM, Peter Bower <pet...@or...> wrote: > >> Given the following scenario: >> >> 1) assign a Japanese literal to a variable (in console or in py file) >> >> 2) print the variable >> >> 3) pass the variable to a java method >> > > The following Python module: > > j = u'\u521d\u671f' > for c in j: > print ord(c) > import sys > sys.setdefaultencoding('utf-8') > print j > import Test > Test.print(j) > > and Java class: > > import java.io.PrintStream; > import java.io.UnsupportedEncodingException; > > public class Test > { > public static void print (String val) throws UnsupportedEncodingException > { > for (int i = 0; i < val.length(); i++) { > System.out.println((int)val.charAt(i)); > } > PrintStream utf8Stream = new PrintStream(System.out, true, "UTF-8"); > utf8Stream.println(val); > } > } > > prints > > 21021 > 26399 > 初期 > 21021 > 26399 > 初期 > > on my terminal in Mac OS X(the third and sixth line may be garbled in > this email, but they actually print out as the characters represented > by \u521d\u671f, I swear:). > > >> In Jython 2.1, it was very simple >> >> name = "<Japanese characters>" >> >> print name >> >> test.create(name) >> >> Everything works, it prints correctly, and the Java method gets the expected >> string. The model appears >> to be: >> >> - Literals are read with the default character set (or that of the >> console encoding) >> >> - Strings can flow from Jython to Java and back without requiring >> conversion >> >> - String are printed using the String.getBytes() method which encodes >> using default character set >> > > This actually doesn't work in all cases, and is one of the reasons > this was changed for 2.2. Java's default character doesn't always > match the encoding of the console it's using e.g. the default encoding > is MacRoman on Mac OS X, but the console uses utf-8 by default. > That's why my Java source above makes its own PrintStream. System.out > uses MacRoman and doesn't print properly to the console. This was > particularly troublesome as Jython would read source files in the > default encoding on one system, and if that source file was used on a > system with a different default encoding, it would either explode or > produce gibberish when the differences in encoding were encountered. > > The bigger reason for the change was to better conform to Python's > Unicode model. Python has two "String" types, str and unicode. str > is a byte string and is created by unadorned quotes. unicode is a > sequence of unicode characters like Java's String and is created by > prepending a u to the quotes. Allowing unicode characters in str as > Jython 2.1 did lead to mismatches between CPython and Jython's models, > and caused the unicode values in the strings to be truncated when > various str operations were performed. Whenever you have character > data, you want unicode objects and strings created with u''. > > >> The 2.2.1 model appears to be >> >> - literals are read with the ISO-8859-1 character set from .py files and >> by default in the console. >> >> - they flow from Jython to Java as is >> >> - strings are printed using the raw bytes (PyString.to_bytes()) >> > > This is correct. The encoding used to read from the interpreter is > controlled with python.console.encoding, but otherwise things are > assumed to be raw byte values. There's no way to have encoded unicode > values in source files in Python 2.2. That was added by PEP 263 in > Python 2.3. The only way to make unicode literals in 2.2 is with \u > values for characters outside the ascii character set. > > >> - is u' required? Does Jython 2.2.1 continue to support the non u' format? >> Or should >> unicode("japanese characters", "jp charset") be used instead (if a jp >> charset was available)? >> > > Either u or calling unicode will work. If you have a large body of > existing source, you can use something like the native2ascii tool that > comes with Java to convert the encoded Japanese values into unicode > escapes. If you need to do it dynamically, something like > http://www.google.com/codesearch/p?hl=en#MzR-vajYaSo/kaffe-1.1.5/libraries/javalib/gnu/classpath/tools/native2ascii/Native2ASCII.java > would work. > > >> - should print <unicode variable> work out of the box? Or do we [and >> customers] need to set the default encoding? >> > > Yes, you'll need to set Python's default encoding to the encoding that > the console uses. I don't know of a way to do this across the Java > platform. System.getProperty("file.encoding") returns Java's default > encoding, but that doesn't always line up with what the console > expects. > > >> - what character set should Java methods expect the string to be in: >> ("ISO-8859-1", the >> default character set, or something else)? >> > > If you've got a unicode value in Python, the String will consist of > the same unicode characters and no encoding is needed. If you have a > str of encoded characters , the String will consist of chars of the > same the same length in whatever encoding the str came in as. > > I'm sorry this transition is proving to be so painful; Jython's > support for unicode was pretty broken in 2.1, and it'll finally work > decently in 2.5 with the addition of PEP 263. > > |
From: Charlie G. <cha...@gm...> - 2009-01-12 17:17:54
|
On Sun, Jan 11, 2009 at 12:49 AM, Rose Pan <ros...@or...> wrote: > Jython 2.2.1 on java1.6.0_05 > Type "copyright", "credits" or "license" for more information. >>>> execfile("test.py") >>>> testdo(u'\u4F7F\u7528') > u'\u4F7F\u7528' >>>> a=u'\u4F7F\u7528' >>>> print a > Traceback (innermost last): > File "<console>", line 1, in ? > UnicodeError: ascii encoding error: ordinal not in range(128) > > The print command in Jython only print the string with the format like > "'\xBB\xC8\xCD\xD1'". This happens because the jython's default encoding is ascii, and that's what it uses to encode things through print. If you call sys.setdefaultencoding(<your console's encoding>) before this, jython will print properly. > 1. Is there a way to handle both cases (setting variables and calling > functions) when embedding InteractiveInterpreter in java? I'm not sure what you're asking here. The cases are setting a variable with a unicode string and calling a function with that string from an embedded InteractiveInterpreter? I don't understand how that's different than running a script directly or by using jython at the console. > 2. Since the unicode characters read from java can not be directly > passed to InteractiveInterpreter.runsource(), it has to convert to > jython unicode string. Is there a convenient method in Jython to convert > java string into jython unicode string that we can call in java code, so > the 'u' can be prepended at the beginning of the multi-byte string, not > the beginning of the whole string? Jython has no builtin way to convert str literals to unicode literals. However, you can encode the Java String source you're passing in to the interpreter, and then decode the Strings that come out of Jython into your Java code. As long as your users aren't writing the Java themselves, nothing on their end will need to change. Here's an example of that: import java.io.PrintStream; import java.nio.charset.Charset; import org.python.util.InteractiveInterpreter; public class Test { static String encoding = "UTF-8"; public static void main (String[] args) throws Exception { String unicode = "a = '\u4F7F\u7528'"; new PrintStream(System.out, true, encoding).println("From Java: " + unicode); InteractiveInterpreter interp = new InteractiveInterpreter(); String source = unicode + "; print 'From Jython: %s' % a; import Test; Test.print(a)"; byte[] encoded = source.getBytes(encoding); String encodedSourceInString = new String(encoded, Charset.forName("ISO-8859-1")); interp.runsource(encodedSourceInString); } public static void print (String encodedStringFromPython) throws Exception { byte[] encoded = encodedStringFromPython.getBytes("ISO-8859-1"); String realString = new String(encoded, encoding); new PrintStream(System.out, true, encoding).println("From Java >From Jython: " + realString); } } which prints out the Japanese String directly from Java, from Jython, and then in Java again in a call from Jython. There's some weird stuff going on in there, so it's probably worth examining a few of the bits more closely. First, I set an encoding I'm going to use for printing to the console from Java and for sending Strings into Jython. On my Mac, the console uses UTF-8, so I use that as the encoding, but you'll need to get the encoding of whatever terminal you're using expects and use that instead. With that encoding, I print a Japanese String to the console from Java just to make sure things are hunky-dory at a base level. I then make an InteractiveInterpreter and some Python source to run in it. The Python source runs the assignment, prints the assign variable and then calls back into the Test class with that variable. I encode the String into bytes using the console encoding, and then I turn it back into a String for use in InteractiveInterpreter.runsource. This is a slightly bizarre use of Strings and Charsets. It uses the fact that ISO-8859-1 is a direct mapping between its byte and char representation to make a String out of the encoded bytes. This lets the encoded representation pass into the interpreter unmolested. With that encoded string, the Python source's print of the variable works properly as it's a str already encoded in the console's encoding, and doesn't pass through Jython's default encoding. Finally, Jython calls back into Test.print with the value from a in the Python source. This is still an encoded Python str, so I use the same ISO-8859-1 trick in reverse to get the encoded bytes out, and turn those bytes back into a String with its constructor that takes an encoding. With a real Java String again, I'm able to print the value from Java. This isn't the prettiest of solutions, but it's the only way I can think of to make this work without changing the underlying source to use unicode literals. If you do have some leeway on that, I'd recommend going that way, but if you're stuck with the encoded source, I believe this will work. |
From: Rose P. <ros...@or...> - 2009-01-12 22:27:21
|
Hi, Charlie, Thanks for the detail explanation. I replaced the encoding to "euc_jp" which my terminal is using and tried out the sample again. It works when printing out in java directly. But it does not work when printing out in Jython and the java in the call from Jython. It works Here is the result: From Java: a = '\u4f7f\u7528' From Jython: ( From Java From Jython: ?? Thanks, Rose Charlie Groves wrote: > On Sun, Jan 11, 2009 at 12:49 AM, Rose Pan <ros...@or...> wrote: > >> Jython 2.2.1 on java1.6.0_05 >> Type "copyright", "credits" or "license" for more information. >> >>>>> execfile("test.py") >>>>> testdo(u'\u4F7F\u7528') >>>>> >> u'\u4F7F\u7528' >> >>>>> a=u'\u4F7F\u7528' >>>>> print a >>>>> >> Traceback (innermost last): >> File "<console>", line 1, in ? >> UnicodeError: ascii encoding error: ordinal not in range(128) >> >> The print command in Jython only print the string with the format like >> "'\xBB\xC8\xCD\xD1'". >> > > This happens because the jython's default encoding is ascii, and > that's what it uses to encode things through print. If you call > sys.setdefaultencoding(<your console's encoding>) before this, jython > will print properly. > > >> 1. Is there a way to handle both cases (setting variables and calling >> functions) when embedding InteractiveInterpreter in java? >> > > I'm not sure what you're asking here. The cases are setting a > variable with a unicode string and calling a function with that string > from an embedded InteractiveInterpreter? I don't understand how > that's different than running a script directly or by using jython at > the console. > > >> 2. Since the unicode characters read from java can not be directly >> passed to InteractiveInterpreter.runsource(), it has to convert to >> jython unicode string. Is there a convenient method in Jython to convert >> java string into jython unicode string that we can call in java code, so >> the 'u' can be prepended at the beginning of the multi-byte string, not >> the beginning of the whole string? >> > > Jython has no builtin way to convert str literals to unicode literals. > However, you can encode the Java String source you're passing in to > the interpreter, and then decode the Strings that come out of Jython > into your Java code. As long as your users aren't writing the Java > themselves, nothing on their end will need to change. Here's an > example of that: > > import java.io.PrintStream; > import java.nio.charset.Charset; > > import org.python.util.InteractiveInterpreter; > > public class Test > { > static String encoding = "UTF-8"; > > public static void main (String[] args) > throws Exception > { > String unicode = "a = '\u4F7F\u7528'"; > new PrintStream(System.out, true, encoding).println("From > Java: " + unicode); > InteractiveInterpreter interp = new InteractiveInterpreter(); > String source = unicode + "; print 'From Jython: %s' % a; > import Test; Test.print(a)"; > byte[] encoded = source.getBytes(encoding); > String encodedSourceInString = new String(encoded, > Charset.forName("ISO-8859-1")); > interp.runsource(encodedSourceInString); > } > > public static void print (String encodedStringFromPython) > throws Exception > { > byte[] encoded = encodedStringFromPython.getBytes("ISO-8859-1"); > String realString = new String(encoded, encoding); > new PrintStream(System.out, true, encoding).println("From Java > >From Jython: " + realString); > } > } > > which prints out the Japanese String directly from Java, from Jython, > and then in Java again in a call from Jython. There's some weird > stuff going on in there, so it's probably worth examining a few of the > bits more closely. > > First, I set an encoding I'm going to use for printing to the console > from Java and for sending Strings into Jython. On my Mac, the console > uses UTF-8, so I use that as the encoding, but you'll need to get the > encoding of whatever terminal you're using expects and use that > instead. > > With that encoding, I print a Japanese String to the console from Java > just to make sure things are hunky-dory at a base level. I then make > an InteractiveInterpreter and some Python source to run in it. The > Python source runs the assignment, prints the assign variable and then > calls back into the Test class with that variable. I encode the > String into bytes using the console encoding, and then I turn it back > into a String for use in InteractiveInterpreter.runsource. This is a > slightly bizarre use of Strings and Charsets. It uses the fact that > ISO-8859-1 is a direct mapping between its byte and char > representation to make a String out of the encoded bytes. This lets > the encoded representation pass into the interpreter unmolested. With > that encoded string, the Python source's print of the variable works > properly as it's a str already encoded in the console's encoding, and > doesn't pass through Jython's default encoding. > > Finally, Jython calls back into Test.print with the value from a in > the Python source. This is still an encoded Python str, so I use the > same ISO-8859-1 trick in reverse to get the encoded bytes out, and > turn those bytes back into a String with its constructor that takes an > encoding. With a real Java String again, I'm able to print the value > from Java. > > This isn't the prettiest of solutions, but it's the only way I can > think of to make this work without changing the underlying source to > use unicode literals. If you do have some leeway on that, I'd > recommend going that way, but if you're stuck with the encoded source, > I believe this will work. > > |
From: Charlie G. <cha...@gm...> - 2009-01-13 17:56:29
|
On Mon, Jan 12, 2009 at 1:45 PM, Rose Pan <ros...@or...> wrote: > Hi, Charlie, > > Thanks for the detail explanation. I replaced the encoding to "euc_jp" which > my terminal is using and tried out the sample again. It works when printing > out in java directly. But it does not work when printing out in Jython and > the java in the call from Jython. It works Here is the result: > > From Java: a = '\u4f7f\u7528' > From Jython: ( > From Java From Jython: ?? I'm not sure what's going on without being able to reproduce your terminal setup. It works for me if I set my console's encoding to euc_jp along with the encoding in the test Java file. Is your result really printing the escaped unicode values instead of rendering single characters? That seems like its broken in Java before it even gets to the Jython. |
From: Rose P. <ros...@or...> - 2009-01-12 22:45:30
|
Hi, Charlie, Also the encoding "euc_jp" is not supported in Jython 2.2.1 yet. We can't run the sys.getdefaultencoding("euc_jp") in a py file for now and this causes us not able to print out u'\u4F7F\u7528' on the console correctly. Any other solution? Thanks, Rose Rose Pan wrote: > Hi, Charlie, > > Thanks for the detail explanation. I replaced the encoding to "euc_jp" > which my terminal is using and tried out the sample again. It works when > printing out in java directly. But it does not work when printing out in > Jython and the java in the call from Jython. It works Here is the result: > > From Java: a = '\u4f7f\u7528' > From Jython: ( > From Java From Jython: ?? > > Thanks, > Rose > > > Charlie Groves wrote: > >> On Sun, Jan 11, 2009 at 12:49 AM, Rose Pan <ros...@or...> wrote: >> >> >>> Jython 2.2.1 on java1.6.0_05 >>> Type "copyright", "credits" or "license" for more information. >>> >>> >>>>>> execfile("test.py") >>>>>> testdo(u'\u4F7F\u7528') >>>>>> >>>>>> >>> u'\u4F7F\u7528' >>> >>> >>>>>> a=u'\u4F7F\u7528' >>>>>> print a >>>>>> >>>>>> >>> Traceback (innermost last): >>> File "<console>", line 1, in ? >>> UnicodeError: ascii encoding error: ordinal not in range(128) >>> >>> The print command in Jython only print the string with the format like >>> "'\xBB\xC8\xCD\xD1'". >>> >>> >> This happens because the jython's default encoding is ascii, and >> that's what it uses to encode things through print. If you call >> sys.setdefaultencoding(<your console's encoding>) before this, jython >> will print properly. >> >> >> >>> 1. Is there a way to handle both cases (setting variables and calling >>> functions) when embedding InteractiveInterpreter in java? >>> >>> >> I'm not sure what you're asking here. The cases are setting a >> variable with a unicode string and calling a function with that string >> from an embedded InteractiveInterpreter? I don't understand how >> that's different than running a script directly or by using jython at >> the console. >> >> >> >>> 2. Since the unicode characters read from java can not be directly >>> passed to InteractiveInterpreter.runsource(), it has to convert to >>> jython unicode string. Is there a convenient method in Jython to convert >>> java string into jython unicode string that we can call in java code, so >>> the 'u' can be prepended at the beginning of the multi-byte string, not >>> the beginning of the whole string? >>> >>> >> Jython has no builtin way to convert str literals to unicode literals. >> However, you can encode the Java String source you're passing in to >> the interpreter, and then decode the Strings that come out of Jython >> into your Java code. As long as your users aren't writing the Java >> themselves, nothing on their end will need to change. Here's an >> example of that: >> >> import java.io.PrintStream; >> import java.nio.charset.Charset; >> >> import org.python.util.InteractiveInterpreter; >> >> public class Test >> { >> static String encoding = "UTF-8"; >> >> public static void main (String[] args) >> throws Exception >> { >> String unicode = "a = '\u4F7F\u7528'"; >> new PrintStream(System.out, true, encoding).println("From >> Java: " + unicode); >> InteractiveInterpreter interp = new InteractiveInterpreter(); >> String source = unicode + "; print 'From Jython: %s' % a; >> import Test; Test.print(a)"; >> byte[] encoded = source.getBytes(encoding); >> String encodedSourceInString = new String(encoded, >> Charset.forName("ISO-8859-1")); >> interp.runsource(encodedSourceInString); >> } >> >> public static void print (String encodedStringFromPython) >> throws Exception >> { >> byte[] encoded = encodedStringFromPython.getBytes("ISO-8859-1"); >> String realString = new String(encoded, encoding); >> new PrintStream(System.out, true, encoding).println("From Java >> >From Jython: " + realString); >> } >> } >> >> which prints out the Japanese String directly from Java, from Jython, >> and then in Java again in a call from Jython. There's some weird >> stuff going on in there, so it's probably worth examining a few of the >> bits more closely. >> >> First, I set an encoding I'm going to use for printing to the console >> from Java and for sending Strings into Jython. On my Mac, the console >> uses UTF-8, so I use that as the encoding, but you'll need to get the >> encoding of whatever terminal you're using expects and use that >> instead. >> >> With that encoding, I print a Japanese String to the console from Java >> just to make sure things are hunky-dory at a base level. I then make >> an InteractiveInterpreter and some Python source to run in it. The >> Python source runs the assignment, prints the assign variable and then >> calls back into the Test class with that variable. I encode the >> String into bytes using the console encoding, and then I turn it back >> into a String for use in InteractiveInterpreter.runsource. This is a >> slightly bizarre use of Strings and Charsets. It uses the fact that >> ISO-8859-1 is a direct mapping between its byte and char >> representation to make a String out of the encoded bytes. This lets >> the encoded representation pass into the interpreter unmolested. With >> that encoded string, the Python source's print of the variable works >> properly as it's a str already encoded in the console's encoding, and >> doesn't pass through Jython's default encoding. >> >> Finally, Jython calls back into Test.print with the value from a in >> the Python source. This is still an encoded Python str, so I use the >> same ISO-8859-1 trick in reverse to get the encoded bytes out, and >> turn those bytes back into a String with its constructor that takes an >> encoding. With a real Java String again, I'm able to print the value >> from Java. >> >> This isn't the prettiest of solutions, but it's the only way I can >> think of to make this work without changing the underlying source to >> use unicode literals. If you do have some leeway on that, I'd >> recommend going that way, but if you're stuck with the encoded source, >> I believe this will work. >> >> >> > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Jython-users mailing list > Jyt...@li... > https://lists.sourceforge.net/lists/listinfo/jython-users > > |
From: Charlie G. <cha...@gm...> - 2009-01-13 18:00:14
|
On Mon, Jan 12, 2009 at 2:45 PM, Rose Pan <ros...@or...> wrote: > Hi, Charlie, > > Also the encoding "euc_jp" is not supported in Jython 2.2.1 yet. We can't > run the sys.getdefaultencoding("euc_jp") in a py file for now and this > causes us not able to print out u'\u4F7F\u7528' on the console correctly. > > Any other solution? It's available from the 2.2 branch in subversion. You'll need to add two files to your Lib/encodings directory in Jython: http://fisheye3.atlassian.com/browse/~raw,r=3747/jython/branches/Release_2_2maint/jython/Lib/encodings/euc_jp.py and http://fisheye3.atlassian.com/browse/~raw,r=3747/jython/branches/Release_2_2maint/jython/Lib/encodings/java_charset_wrapper.py |
From: Rose P. <ros...@or...> - 2008-12-18 15:28:45
|
Hi, More info on Jython 2.2.1. Setting a property -Dpython.console.encoding=EUC_JP_LINUX does not help to get the correct unicode. In our java code, if we use the following method to copy the string "\xBB\xC8\xCD\xD1" to byte array after Jython returned the value from running the py file, then the java System.out.println() can print the correct multi-byte string on the console. public static byte[] to_bytes(String s) { int len = s.length(); byte[] b = new byte[len]; s.getBytes(0, len, b, 0); <-- Copies characters from this string into the destination byte array. Each byte receives the 8 low-order bits of the corresponding character. The eight high-order bits of each character are not copied and do not participate in the transfer in any way. return b; } But with this workaround, we have to transfer every String returned from the py files to the byte array using the method above. This is not acceptable as we have more than 100 functions defined in the py files and each function has multiple parameters of type String. Has anybody encountered the same issue? I think this is a very common problem for Jython as Jython is now used world widely. Any help / comments would be really appreciated. Thanks, Rose Rose Pan wrote: > Hi, Jython gurus: > > I need some help on running Jython 2.2.1 with multi-byte strings. > > Jython 2.2.1 cannot pass a unicode String correctly to a function > defined in a py script. The value of the parameter is converted to > different \x format. > This is not happened in Jython 2.1. > > To reproduce it, define a py script, test.py file. The test.py file > defines a function called create() which simply returns the value of the > parameter: > > ======= start of test.py ====== > def create(name): > return name > ======= end of test.py ===== > > Then start Jython 2.1 and run the function create() from the py file: > > java -classpath jython.jar.2.1 org.python.util.jython > Jython 2.1 on java1.6.0_05 (JIT: null) > > execfile("test.py") > create('\u4f7f\u7528') <-- input Japanese characters > u'\u4F7F\u7528' <-- return the same unicode representing the > Japanese characters with length 2 > > > We can see the output of create function returns a two-byte unicode, > which can be displayed correctly by Java System.out.println() method. > > Then we try Jython 2.2.1 with the same step: > > java -classpath jython.jar.2.2.1 org.python.util.jython > Jython 2.2.1 on java1.6.0_05 > > execfile("test.py") > create('\u4f7f\u7528') <-- input Japanese characters > '\xBB\xC8\xCD\xD1' <-- returns different values with length 4. > > The \xBB\xC8\xCD\xD1 are not recognized by java so we always get "????" > if use System.out.println() to print. > > This is a regression for Jython 2.2.1. > > This is going to affect all the customer written py files. Is there a > workaround for this in Jython? Jython 2.5 seems to have the same issue. > > Thanks, > Rose > > > > > > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ > _______________________________________________ > Jython-users mailing list > Jyt...@li... > https://lists.sourceforge.net/lists/listinfo/jython-users > > |
From: Charlie G. <cha...@gm...> - 2009-01-10 12:31:12
|
On Thu, Jan 8, 2009 at 12:39 PM, Peter Bower <pet...@or...> wrote: > Given the following scenario: > > 1) assign a Japanese literal to a variable (in console or in py file) > > 2) print the variable > > 3) pass the variable to a java method The following Python module: j = u'\u521d\u671f' for c in j: print ord(c) import sys sys.setdefaultencoding('utf-8') print j import Test Test.print(j) and Java class: import java.io.PrintStream; import java.io.UnsupportedEncodingException; public class Test { public static void print (String val) throws UnsupportedEncodingException { for (int i = 0; i < val.length(); i++) { System.out.println((int)val.charAt(i)); } PrintStream utf8Stream = new PrintStream(System.out, true, "UTF-8"); utf8Stream.println(val); } } prints 21021 26399 初期 21021 26399 初期 on my terminal in Mac OS X(the third and sixth line may be garbled in this email, but they actually print out as the characters represented by \u521d\u671f, I swear:). > In Jython 2.1, it was very simple > > name = "<Japanese characters>" > > print name > > test.create(name) > > Everything works, it prints correctly, and the Java method gets the expected > string. The model appears > to be: > > - Literals are read with the default character set (or that of the > console encoding) > > - Strings can flow from Jython to Java and back without requiring > conversion > > - String are printed using the String.getBytes() method which encodes > using default character set This actually doesn't work in all cases, and is one of the reasons this was changed for 2.2. Java's default character doesn't always match the encoding of the console it's using e.g. the default encoding is MacRoman on Mac OS X, but the console uses utf-8 by default. That's why my Java source above makes its own PrintStream. System.out uses MacRoman and doesn't print properly to the console. This was particularly troublesome as Jython would read source files in the default encoding on one system, and if that source file was used on a system with a different default encoding, it would either explode or produce gibberish when the differences in encoding were encountered. The bigger reason for the change was to better conform to Python's Unicode model. Python has two "String" types, str and unicode. str is a byte string and is created by unadorned quotes. unicode is a sequence of unicode characters like Java's String and is created by prepending a u to the quotes. Allowing unicode characters in str as Jython 2.1 did lead to mismatches between CPython and Jython's models, and caused the unicode values in the strings to be truncated when various str operations were performed. Whenever you have character data, you want unicode objects and strings created with u''. > The 2.2.1 model appears to be > > - literals are read with the ISO-8859-1 character set from .py files and > by default in the console. > > - they flow from Jython to Java as is > > - strings are printed using the raw bytes (PyString.to_bytes()) This is correct. The encoding used to read from the interpreter is controlled with python.console.encoding, but otherwise things are assumed to be raw byte values. There's no way to have encoded unicode values in source files in Python 2.2. That was added by PEP 263 in Python 2.3. The only way to make unicode literals in 2.2 is with \u values for characters outside the ascii character set. > - is u' required? Does Jython 2.2.1 continue to support the non u' format? > Or should > unicode("japanese characters", "jp charset") be used instead (if a jp > charset was available)? Either u or calling unicode will work. If you have a large body of existing source, you can use something like the native2ascii tool that comes with Java to convert the encoded Japanese values into unicode escapes. If you need to do it dynamically, something like http://www.google.com/codesearch/p?hl=en#MzR-vajYaSo/kaffe-1.1.5/libraries/javalib/gnu/classpath/tools/native2ascii/Native2ASCII.java would work. > - should print <unicode variable> work out of the box? Or do we [and > customers] need to set the default encoding? Yes, you'll need to set Python's default encoding to the encoding that the console uses. I don't know of a way to do this across the Java platform. System.getProperty("file.encoding") returns Java's default encoding, but that doesn't always line up with what the console expects. > - what character set should Java methods expect the string to be in: > ("ISO-8859-1", the > default character set, or something else)? If you've got a unicode value in Python, the String will consist of the same unicode characters and no encoding is needed. If you have a str of encoded characters , the String will consist of chars of the same the same length in whatever encoding the str came in as. I'm sorry this transition is proving to be so painful; Jython's support for unicode was pretty broken in 2.1, and it'll finally work decently in 2.5 with the addition of PEP 263. |