Thread: [java-gnome-hackers] Java JNI functions' UTF-8 not valid
Brought to you by:
afcowie
From: Andrew C. <an...@op...> - 2009-07-28 07:42:32
|
Tests this week have uncovered serious problems when you try to put a supplementary character (a Unicode character whose index is > 0xFFFF and which therefore needs > 1 2-byte char to represent). The case I've been working with is U+1D45B, the 𝑛 character from the Mathematical Alphanumeric Symbols block http://www.fileformat.info/info/unicode/char/1d45b/index.htm It doesn't matter to us what Java does internally because when we leave Java and pass string data to GNOME we have [of course] used the JNI function GetStringUTFChars() which takes a jstring [ie, java.lang.String] and returns a char* [ie, const gchar*] to a UTF-8 representation of the string. I'd tested this heavily in the past; ValidatePangoTextRendering and others. All good, especially with characters > 0x7f and > 0xFF, ie, two or three bytes in UTF-8. But when I tried entering 𝑛 and 𝕊 [which have to be 2 Java chars wide (UTF-16) and more to the point are *four* UTF-8 bytes wide] into a TextView, the resultant call through gtk_text_buffer_insert() resulted in an g_utf8_validate() assertion failure. Oh shit. Not valid UTF-8! What the heck? The obvious conclusion is that GetStringUTFChars() doesn't return valid UTF-8 after all. Serkan tells me that this is a fairly well known problem; he pointed me to http://www.ingrid.org/java/i18n/utf-16/ (frankly that page makes it even _more_ confusing) but digging around the JNI spec it vaguely talks about the UTF-8 support being limited to 3-bytes. So, oh shit, confirmed. Yes, I could live without the 𝑛 character, but the work I'm doing is aimed at (among other things) mathematical and physics manuscripts, and so I know I'm going to have people hitting this problem. More to the point, it just doesn't do for a valid character that the rest of GNOME handles fine to not work when paste or typed into a java-gnome Widget. And I get the impression that there's a lot of CJK and Indic activity up in supplementary range, so this needs addressing. What to do about this? I poked around a bit, and it didn't take long to find that GLib has a set of conversion functions. g_utf16_to_utf8() seems promising, right? Anyone who read 1b-Homework knows that JNI has two sets of string functions: GetStringChars() and GetStringUTFChars(); NewString() and NewStringUTF(). I'd long wondered about the existence of the first set; their documentation talks about "getting real unicode characters" instead of UTF-8 encoded ones. [Ah, such hubris. Remember when everyone laughed at Java because we all thought 2 byte char was hardly necessary? Amazing that 10 years later 2 bytes is not enough; makes the UTF-8 choice back around the time of GLib and XML look like a pretty good call] So that got me wondering just what Java's (well, JNI's) idea of a "unicode character" (sic) actually is. It took a bit of digging, but it seems they actually mean UTF-16 encoding. See http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep which seems to say fairly authoritatively (nothing like a FAQ, heh) that Java uses UTF-16 in the char[] that back String objects. Ok. If GetStringChars()'s "unicode characters" (sic) are valid UTF-16 characters, then we can use g_utf16_to_utf8() to get *real* UTF-8 to feed to GTK's and other GNOME libraries' functions. And g_utf8_to_utf16() + NewString() to go the other way. ++ I prototyped it and it stopped my apps from thundering in. Good sign. So I implemented it for real. Doing the code generator was a snap (yeay); it only took a few hours to get all the hand written code as well. See branch 'hackers/andrew/unicode' at bzr://research.operationaldynamics.com/bzr/java-gnome/hackers/andrew/unicode/ New functions: bindings_java_getString(); bindings_java_newString(); There's a: bindings_java_releaseString() which mirrors JNI's ReleaseStringUTFChars() and should be called on strings returned by the bindings_java_getString() above. I was actually able to use JNI's GetStringCritical() and ReleaseStringCritical() functions. These are "dangerous" but gives us a jchar* pointer to the actual char[] in the VM's managed heap space, which saves us a copy. I think they're being used correctly. Did I use const correctly? ++ As far as I can tell this all works and is safe. Performance of bindings_java_getString() seems comparable to the JNI GetStringUTFChars(), so that's good. Please PLEASE test this branch with your apps. It is a hugely invasive change to java-gnome's internal working, so it's not just a case of "oh, I don't use high-range characters". This impacts *all* string handling, so I'd appreciate some testing before we decide to accept this approach. If you've got a better idea for an implementation, we now have all Java -> C string conversion abstracted into one place, so you can experiment there if you want to. AfC Sydney -- Andrew Frederick Cowie Operational Dynamics is an operations and engineering consultancy focusing on IT strategy, organizational architecture, systems review, and effective procedures for change management: enabling successful deployment of mission critical information technology in enterprises, worldwide. http://www.operationaldynamics.com/ Sydney New York Toronto London |
From: Serkan K. <se...@ge...> - 2009-07-31 08:34:25
|
Although I don't work with that math symbols nor weird characters, thanks for discovering and fixing the issue. I tested libnotify and my last.fm tutorial against it and they work just fine. Sincerely, Serkan KABA |
From: Andrew C. <an...@op...> - 2009-07-31 14:12:49
|
On Fri, 2009-07-31 at 11:34 +0300, Serkan Kaba wrote: > Although I don't work with that math symbols nor weird characters, > thanks for discovering and fixing the issue. I tested libnotify and my > last.fm tutorial against it and they work just fine. Thanks Serkan. That makes three people who have checked in and said things were ok. I think we'll go with this; merged to 'mainline'. If anyone who encounters a problem, *please* send a TestCase subclass [sure, go ahead and just add a fixture to ValidateUnicode] that fails so we can debug with your use case. Thanks! AfC Sydney -- Andrew Frederick Cowie Operational Dynamics is an operations and engineering consultancy focusing on IT strategy, organizational architecture, systems review, and effective procedures for change management: enabling successful deployment of mission critical information technology in enterprises, worldwide. http://www.operationaldynamics.com/ Sydney New York Toronto London |
From: Serkan K. <se...@ge...> - 2009-08-01 07:59:32
|
Just to point out I found the original documents I intented to mention. http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html section on "Modified UTF-8" (Actually I spotted this somewhere in api docs but anyway it explains the exact same stuff) http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 and also javadoc of java.lang.Character Sincerely Serkan KABA |