I appreciate your reply. Here are my response to your analysis.
1) I tried iconv -f utf-8 -t utf-8 filename on Linux system and there is no
error. I had umlaut in the file.
2) In Windows XP file.encoding is infact Cp1252. I guess this one is not
lossy
for Umlaut and other Latin-1 characters.
3) Since my database is on UTF-8, I want keep JVM on UTF-8 as well. Still I
am going to try using UTF-16 for the JVM. I have not tried yet.
4) On Linux when I pass the script to the jython interpreter in "ISO-8859-1"
encoding it works fine.
For example following code works fine.
String utfScript = "a=\"ü\"";
byte[] bytes = utfScript.getBytes("UTF-8");
String isoScript = new String(bytes, "ISO-8859-1");
intertpreter.exec(isoScript);
Please note JVM System property for file.encoding is UTF-8.
Thanks
Dipankar
----- Original Message -----
From: "Ed Korthof" <ed@...>
To: "Dipankar Das" <ddas@...>
Cc: <jython-users@...>
Sent: Thursday, January 01, 2004 4:04 PM
Subject: Re: [Jython-users] How to make jython work with UTF-8
On Tue, Dec 30, 2003 at 01:47:55PM -0800, Dipankar Das wrote:
> >From java class I call the follwoing line of code.
> intertpreter.exec(a='ü')
>
> Same code works fine on WindowsXP. I cannot run it on Linux.
"file.encoding"
> property for JVM has been set to "UTF-8". On Linux I get the following
> error.
>
> SyntaxError: LexicalError at line 1, column 5. Encountered: <EOF> after:
""
[snip]
> > What are the necessary steps to configure jython to accept and return
> UTF-8
> > characters.
[snip]
I haven't done I18N work in Jython, except with libraries that were
already doing the conversion. Jython certainly handles unicode data
fine if other code has done the conversion from UTF-8 to the internal
unicode representation (in Java, it's a UTF-16 variant). And Java has a
number of idioms for working with such ... anyway, while I can't answer
the above directly, I can talk more generally about I18N and Java.
Is the text in question actually UTF-8? i.e. if you run it through
iconv -f utf-8 -t utf-8, do you get an error? If you're testing at the
command line, you can put it in a file and look to see what you get.
When doing I18N w/ Java, I've found that using UTF-8 as JVM-default
tends to cause problems, because converting bytes into strings and back
(which programmers tend to do without realizing this is a potentially
lossy operation) will mangle binary data which is not actually UTF-8 --
as would be the cause w/ ISO-8859-1 (the charset in which this message
was sent). Umlaut is a character in ISO-8859-1 with the high bit set;
but it's not part of a proper UTF-8 byte sequence, so it will cause
Linux's conversion utilities to halt at that point when in the strict
mode. (iconv will show an error unless -c is provided; if -c is
provided, it will just omit the characters. Java's readers tend to
throw exceptions, but string's constructor transforms unknown bytes or
byte sequences into question marks.)
That might explain the EOF after "" -- if the JVM did a conversion from
bytes to a String but ceased conversion with an internal error after the
first byte with the high bit set). In Windows, are you sure the
file.encoding property has taken, and that it is the default for
conversion of bytes to Strings?
cheers --
Ed
-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills. Sign up for IBM's
Free Linux Tutorials. Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=ick
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users
|