From: Wong, A. <Alb...@we...> - 2002-08-29 05:55:33
|
Hi Mark, =20 I understand that the standard way to compare strings using a collator is: =20 Collator myCollator =3D Collator.getInstance(myLocale); if (myCollator.compare(string1, string2) =3D=3D 0) { doSomething(...); } The above code indicates that, in theory, the same code can be used for various countries and still make the correct string comparisons--all I need to provide is the locale. =20 In terms of my (extremely limited) experience so far in using the collator there seems to be specific customizations required for certain cases. =20 eg. the above code would fail to recognize German's "ss" and Beta" characters should be treated as the same. =20 The ways to solve these cases was to create a custom RuleBasedCollator with the locale specific rules. =20 eg.=20 =20 If (dealing with Germany) { // create ruleBasedCollator with additonal German specific collation rules (like ss=3D=3D beta, or ue=3D=3D u with two dots on top, etc....) = not in UCA RuleBasedCollator rbc =3D new RuleBasedCollator(additonalRules); } else if (dealing with China) { // create ruleBasedCollator with additonal China specific collation rules (like ss=3D=3D beta, or ue=3D=3D u with two dots on top, etc....) = not in UCA RuleBasedCollator rbc =3D new RuleBasedCollator(additonalRules); } else if (...) etc for all countries supported=20 =20 I'm a bit worried that the character rules not in UCA (for example UCA consider German's ss !=3D beta, ue!=3Du with two dots etc) are going to = have to be added by the developer using the collator. I understand custom rules are needed for circumstances where the string comparions desired are NOT the norm for a particular country, but the ss/beta, ue/u with two dots, etc. seem like it's standard in Germany. Using this as comparison, I'm assuming there's going to be cases for other countries where standard collation rules are excluded from the UCA. =20 =20 This would also incidicate to me that I would have to do this for every country I plan to support. Which might not be too bad, but I would have to find out what rules that are customary/standard for a particular country and NOT in the UCA. Once I have that info then I can create a custom collator with those additonal rules tacked on. =20 So, rather than having code like: =20 Collator myCollator =3D Collator.getInstance(myLocale); if (myCollator.compare(string1, string2) =3D=3D 0) { doSomething(...); } I'd have a bunch of if else checks to create a collator with the additonal rules not covered by UCA. =20 What I'd like to do is be able to search through a text file for all occurances of a given word. That word might be spelled using different characters (like heissen and heiBetassen) but I'd still like to be able to find it. Initially it seems like the only thing to do is grab a collator given a locale and everything works out. But based on these emails, it seems I'd need to find out the rules not covered by UCA for each country I plan to support and add them myself using a custom RuleBasedCollator. =20 Is this correct? =20 Thanks for help! Albert =20 =20 =20 =20 =20 >So far all my reading (Oreilly Internationalization Book, various web sites) seems to indicate that various languages do consider various characters equal, so it is a big problem in internationization. But, the code examples I've seen with IC4J seem to require custom Collators or Locale's. The Collator *is* the mechanism by which you compare characters according to the conventions of a particular locale. You normally just create a collator based on a locale, like this: Collator myCollator =3D Collator.getInstance(myLocale); You then use that collator to compare strings, e.g. if (myCollator.compare(string1, string2) =3D=3D 0) { doSomething(...); } The code you cite will not do anything useful, since PHONEBOOK is only a defined variant in some locales (German ones) where it makes sense. It is very unclear exactly what you think you want to do and why. I also suggest that you first read over Section 5.17 of the Unicode Standard ( http://www.unicode.org/unicode/uni2book/u2.html), plus the Collation section of the ICU User's guide ( http://oss.software.ibm.com/icu/userguide/Collate_Intro.html) Mark |