Re: [Jython-users] Help for issue 1183 Jython 2.2.1 cannot pass unicode to a func in a py file

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Jan 8, 2009 at 12:39 PM, Peter Bower <pet...@or...> wrote:
> Given the following scenario:
>
> 1) assign a Japanese literal to a variable (in console or in py file)
>
> 2) print the variable
>
> 3) pass the variable to a java method

The following Python module:

j = u'\u521d\u671f'
for c in j:
    print ord(c)
import sys
sys.setdefaultencoding('utf-8')
print j
import Test
Test.print(j)

and Java class:

import java.io.PrintStream;
import java.io.UnsupportedEncodingException;

public class Test
{
    public static void print (String val) throws UnsupportedEncodingException
    {
        for (int i = 0; i < val.length(); i++) {
            System.out.println((int)val.charAt(i));
        }
        PrintStream utf8Stream = new PrintStream(System.out, true, "UTF-8");
        utf8Stream.println(val);
    }
}

prints

21021
26399
初期
21021
26399
初期

on my terminal in Mac OS X(the third and sixth line may be garbled in
this email, but they actually print out as the characters represented
by \u521d\u671f,  I swear:).

> In Jython 2.1, it was very simple
>
>     name = "<Japanese characters>"
>
>     print name
>
>     test.create(name)
>
> Everything works, it prints correctly, and the Java method gets the expected
> string. The model appears
> to be:
>
>     - Literals are read with the default character set (or that of the
> console encoding)
>
>     - Strings can flow from Jython to Java and back without requiring
> conversion
>
>     - String are printed using the String.getBytes() method which encodes
> using default character set

This actually doesn't work in all cases, and is one of the reasons
this was changed for 2.2.  Java's default character doesn't always
match the encoding of the console it's using e.g. the default encoding
is MacRoman on Mac OS X, but the console uses utf-8 by default.
That's why my Java source above makes its own PrintStream.  System.out
uses MacRoman and doesn't print properly to the console.  This was
particularly troublesome as Jython would read source files in the
default encoding on one system, and if that source file was used on a
system with a different default encoding, it would either explode or
produce gibberish when the differences in encoding were encountered.

The bigger reason for the change was to better conform to Python's
Unicode model.  Python has two "String" types, str and unicode.  str
is a byte string and is created by unadorned quotes.  unicode is a
sequence of unicode characters like Java's String and is created by
prepending a u to the quotes.  Allowing unicode characters in str as
Jython 2.1 did lead to mismatches between CPython and Jython's models,
and caused the unicode values in the strings to be truncated when
various str operations were performed.  Whenever you have character
data, you want unicode objects and strings created with u''.

> The 2.2.1 model appears to be
>
>     - literals are read with the ISO-8859-1 character set from .py files and
> by default in the console.
>
>     - they flow from Jython to Java as is
>
>     - strings are printed using the raw bytes (PyString.to_bytes())

This is correct.  The encoding used to read from the interpreter is
controlled with python.console.encoding, but otherwise things are
assumed to be raw byte values.  There's no way to have encoded unicode
values in source files in Python 2.2.  That was added by PEP 263 in
Python 2.3.  The only way to make unicode literals in 2.2 is with \u
values for characters outside the ascii character set.

> - is u' required? Does Jython 2.2.1 continue to support the non u' format?
> Or should
>   unicode("japanese characters", "jp charset") be used instead (if a jp
> charset was available)?

Either u or calling unicode will work.  If you have a large body of
existing source, you can use something like the native2ascii tool that
comes with Java to convert the encoded Japanese values into unicode
escapes.  If you need to do it dynamically, something like
http://www.google.com/codesearch/p?hl=en#MzR-vajYaSo/kaffe-1.1.5/libraries/javalib/gnu/classpath/tools/native2ascii/Native2ASCII.java
would work.

> - should print <unicode variable> work out of the box? Or do we [and
> customers] need to set the default encoding?

Yes, you'll need to set Python's default encoding to the encoding that
the console uses.  I don't know of a way to do this across the Java
platform.  System.getProperty("file.encoding") returns Java's default
encoding, but that doesn't always line up with what the console
expects.

> - what character set should Java methods expect the string to be in:
> ("ISO-8859-1", the
>   default character set, or something else)?

If you've got a unicode value in Python, the String will consist of
the same unicode characters and no encoding is needed.  If you have a
str of encoded characters , the String will consist of chars of the
same the same length in whatever encoding the str came in as.

I'm sorry this transition is proving to be so painful; Jython's
support for unicode was pretty broken in 2.1, and it'll finally work
decently in 2.5 with the addition of PEP 263.