Re: Unicode Character Form Handling In Character conversions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The converters do not normalize the text data.  For instance, if you 
convert from UTF-8 to UTF-16, you will get the same logical string (same 
code points, different encoding).

It's usually a good idea to send NFC (precomposed) text data to the 
converters.  You don't need to use the collation service for 
normalization, you can use the unorm_* API as described in the User's 
Guide and the API reference.

George Rhoten
IBM Globalization Center of Competency/ICU  San Jose, CA, USA

"Chew, Christopher" <Chr...@so...>
Sent by: icu...@ww...
07/29/2003 05:21 AM

        To:     "'icu...@os...'" 
<icu...@ww...>
        cc: 
        Subject:        Unicode Character Form Handling In Character conversions

Hi, 

Does ICU character conversion service ensure that characters from various 
codepage encodings are converted to a consistent normalized form so that 
binary (codepoint) comparison on these data can be consistently made? Or 
is this not guaranteed as it varies from one codepage to another as 
defined in the character mapping tables?

If the source data is already in Unicode (same encoding as UChar) and I 
explicitly perform a toUnicode() conversion on it, does the target data be 
just a copy of the source, or some form of transcoding being made to 
ensure that the output Unicode data is always in a consistent form 
regardless of whether the input source contains normalized or 
un-normalized, and perhaps semi-normalized forms of Unicode characters?

Similar question is that if I were to perform a fromUnicode() character 
conversion to some arbitrary codepage, will the result differ if the 
source buffer contains various normalized forms of Unicode data?

If ICU conversion module doesn't cater for all that, should I then have to 
perform some explicit normalization of the converted Unicode data via 
ICU's collation service if I want to deal with characters in Unicode on a 
consistent manner? 

This would improve performance if I can just do a codepoint-to-codepoint 
comparison on my Unicode strings instead of relying on ICU root locale 
codepoint collation if I knew that all Unicode character data being 
handled within my application is always in one consistent form.

Best Wishes 
Christopher 

-----Original Message----- 
From: Chew, Christopher  
Sent: Tuesday, July 29, 2003 11:30 AM 
To: 'icu...@os...' 
Subject: Root Locale in Collation 

Hi there, 

My objective is to perform codepoint-to-codepoint comparison or sorting 
operations on Unicode strings that may not be normalized, so the first 
thing that comes to my mind is to use the ICU collation service. But how 
should the ICU collator be instantiated for such purpose?

If I am to specify an empty string for the locale string via the 
ucol_open() function to create the ICU collator, the root locale is used. 
Is this also known as the default "Unicode codepoint collation" or must 
the strength collation attribute be set to IDENTICAL to achieve that? 

Is this UCA being employed able to perform a good enough collation for 
most Western languages, since I suspect that it is able to handle 
invariant and Latin extended characters quite well? 

Best Regards 
Christopher 

Re: Unicode Character Form Handling In Character conversions

Open Source C/C++/Java libraries from Unicode

Re: Unicode Character Form Handling In Character conversions