From: George R. <gr...@us...> - 2003-07-29 20:06:42
|
The converters do not normalize the text data. For instance, if you convert from UTF-8 to UTF-16, you will get the same logical string (same code points, different encoding). It's usually a good idea to send NFC (precomposed) text data to the converters. You don't need to use the collation service for normalization, you can use the unorm_* API as described in the User's Guide and the API reference. George Rhoten IBM Globalization Center of Competency/ICU San Jose, CA, USA "Chew, Christopher" <Chr...@so...> Sent by: icu...@ww... 07/29/2003 05:21 AM To: "'icu...@os...'" <icu...@ww...> cc: Subject: Unicode Character Form Handling In Character conversions Hi, Does ICU character conversion service ensure that characters from various codepage encodings are converted to a consistent normalized form so that binary (codepoint) comparison on these data can be consistently made? Or is this not guaranteed as it varies from one codepage to another as defined in the character mapping tables? If the source data is already in Unicode (same encoding as UChar) and I explicitly perform a toUnicode() conversion on it, does the target data be just a copy of the source, or some form of transcoding being made to ensure that the output Unicode data is always in a consistent form regardless of whether the input source contains normalized or un-normalized, and perhaps semi-normalized forms of Unicode characters? Similar question is that if I were to perform a fromUnicode() character conversion to some arbitrary codepage, will the result differ if the source buffer contains various normalized forms of Unicode data? If ICU conversion module doesn't cater for all that, should I then have to perform some explicit normalization of the converted Unicode data via ICU's collation service if I want to deal with characters in Unicode on a consistent manner? This would improve performance if I can just do a codepoint-to-codepoint comparison on my Unicode strings instead of relying on ICU root locale codepoint collation if I knew that all Unicode character data being handled within my application is always in one consistent form. Best Wishes Christopher -----Original Message----- From: Chew, Christopher Sent: Tuesday, July 29, 2003 11:30 AM To: 'icu...@os...' Subject: Root Locale in Collation Hi there, My objective is to perform codepoint-to-codepoint comparison or sorting operations on Unicode strings that may not be normalized, so the first thing that comes to my mind is to use the ICU collation service. But how should the ICU collator be instantiated for such purpose? If I am to specify an empty string for the locale string via the ucol_open() function to create the ICU collator, the root locale is used. Is this also known as the default "Unicode codepoint collation" or must the strength collation attribute be set to IDENTICAL to achieve that? Is this UCA being employed able to perform a good enough collation for most Western languages, since I suspect that it is able to handle invariant and Latin extended characters quite well? Best Regards Christopher |