RE: Root Locale in CollationNo, you don't always want the same =
comparison. Sometimes you want binary comparison, sometimes you want =
language-sensitive comparison. In a file-system, for example, you need =
an invariant collation when you compose the B-tree structure; that is =
typically a binary comparison, sometimes a binary caseless comparison. =
But in a user-interface when you sort files in a folder, you need to =
have a language-sensitive comparison so each end-user see the order s/he =
expects.
The reason these use different APIs in ICU is that (a) there are very =
different possible options on each, (b) binary is much much simpler, and =
can be done with a simpler API and code.
Some of the options on binary comparison is whether it is normalized or =
not, or whether it is case senstive or not. Collation has a large raft =
of other options.
Your breakdown is along the wrong axis, since string comparion and =
string sorting (and, for that matter, string searching) should align. It =
should really be:
a) binary comparison and/or sorting and/or searching
b) language-sensitive comparison and/or sorting and/or searching
Mark
__________________________________
http://www.macchiato.com
=E2=96=BA =E2=80=9CEppur si muove=E2=80=9D =E2=97=84
=20
----- Original Message -----=20
From: Chew, Christopher=20
To: 'Vladimir Weinstein' ; 'mark.davis@...
Cc: 'icu4c-support@...
Sent: Tuesday, July 29, 2003 11:47
Subject: RE: Root Locale in Collation
Hi Vladimir and Mark,=20
thanks for the insights.=20
I'm basically just confused over the concept of collation and string =
comparisons. I thought they should be addressed in the same manner. If I =
want to know whether one string is larger than, smaller than, or equal =
to another, the context of this comparison operation can take many =
forms. Does this mean bitwise, codepoint, or locale-specific (language =
sensitive)?=20
Is there some generally-accepted rule that one has to adopt when it =
comes to comparing Unicode strings or it depends on the application =
requirement or context of the operation?=20
I think it would make sense that all Unicode string comparison be done =
in a consistent manner so that it can perform standalone stringA vs =
stringB comparison as well as serving as the foundation for string =
ordering correctly.=20
That's why I pondered on whether to use ICU collation service for all =
string comparison operations as well as sorting, since it is able to =
handle various normalized forms of the Unicode text. I thought that =
perhaps if I specify the root locale, I would achieve string comparison =
by Unicode codepoint value (which I am wrong as Mark has pointed out - =
it is the *last* tie-breaker). Thus the term "Unicode codepoint =
collation" ;-P
I guess I have to distinguish sorting and string comparisons =
completely:=20
- string sorting to be language-sensitive or default root locale (pure =
UCA);=20
- standalone string comparisons to be switchable on whether it should =
be made in a codepoint or language-sensitive (via the ICU collation =
service string comparison functions) depending on my application =
requirement on how this comparison result is used for.
What do you think?=20
Best Regards=20
Christopher=20
-----Original Message 1-----=20
From: Mark Davis [mailto:mark.davis@...
Sent: Tuesday, July 29, 2003 8:24 PM=20
To: Chew, Christopher; icu4c-support@...
Subject: Re: Root Locale in Collation=20
I'm not quite sure what you want.=20
ICU collation is designed for language-sensitive collation. Because =
human languages have features like strength levels, contractions and =
expansions (see the ICU User Guide and =
http://www.unicode.org/reports/tr10/tr10-10.html for background), =
comparing strings by doing a codepoint by codepoint comparison will give =
you the wrong answer almost always.
It sounds more like what you want is a binary comparison, which =
doesn't even attempt to provide language-sensitive comparisons. ICU does =
offer that, but it is not in the Collator. The IDENTICAL level is *not* =
binary comparison; instead, it uses a (normalized) binary comparison as =
the *last*, tie-breaking level in language-sensitive comparison.
The ICU root collator is the UCA ordering, in UTS #10 (link above). It =
is not a binary comparison. I don't know of anything called "Unicode =
codepoint collation"; I haven't heard that term used before.
The UCA typically needs to be tailored for given languages; we =
currently have 55 localized tailoring rule sets. If you want to see the =
specific languages that are tailored in ICU, take a look at the locale =
resource bundles or Locale Explorer.
Mark=20
__________________________________=20
http://www.macchiato.com=20
? "Eppur si muove" ?=20
=20
-----Original Message 2-----=20
From: Vladimir Weinstein [mailto:weiv@...
Sent: Tuesday, July 29, 2003 8:00 PM=20
To: Chew, Christopher=20
Cc: 'icu4c-support@...
Subject: Re: Root Locale in Collation=20
Hi,=20
I'm not entirely sure what do you mean by codepoint-to-codepoint =
comparison. If=20
you are trying to sort strings in Unicode codepoint order, then you =
should=20
normalize them and use u_strcmp.=20
UCA does not sort strings in codepoint order, but rather according to =
the=20
Unicode Collation Algorithm (UCA) order. You can see that order if you =
take a=20
look at icu/source/data/unidata/FractionalUCA.txt=20
This is the default order which works for some latin based languages - =
but not=20
for all. You can take a look at the txt files in =
icu/source/data/locales. If a=20
file contains "CollationElements" resource with a tailoring, that =
means that the=20
order for that locale is not the pure UCA order.=20
If you are using UCA, you should not have to use more than tertiary =
strength.=20
Please let me know if you need more information.=20
Hope this helps.=20
Regards,=20
v.=20
Chew, Christopher wrote:=20
> Hi there,=20
>=20
> My objective is to perform codepoint-to-codepoint comparison or =
sorting=20
> operations on Unicode strings that may not be normalized, so the =
first=20
> thing that comes to my mind is to use the ICU collation service. But =
how=20
> should the ICU collator be instantiated for such purpose?=20
>=20
> If I am to specify an empty string for the locale string via the=20
> ucol_open() function to create the ICU collator, the root locale is=20
> used. Is this also known as the default "Unicode codepoint =
collation" or=20
> must the strength collation attribute be set to IDENTICAL to achieve =
that?=20
>=20
> Is this UCA being employed able to perform a good enough collation =
for=20
> most Western languages, since I suspect that it is able to handle=20
> invariant and Latin extended characters quite well?=20
>=20
> Best Regards=20
> Christopher=20
>=20
--=20
Vladimir Weinstein, IBM GCoC-Unicode/ICU San Jose, CA weiv@...
|