RE: Problems matching characters based on locale

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Mark,
=20
I understand that the standard way to compare strings using a collator
is:
=20
Collator myCollator =3D Collator.getInstance(myLocale);

if (myCollator.compare(string1, string2) =3D=3D 0) {
  doSomething(...);
}

The above code indicates that, in theory,  the same code can be used for
various countries and still make the correct string comparisons--all I
need to provide is the locale.
=20
In terms of my (extremely limited) experience so far in using the
collator  there seems to be specific customizations required for certain
cases.
=20
eg. the above code would fail to recognize German's "ss" and Beta"
characters should be treated as the same.
=20
The ways to solve these cases was to create a custom RuleBasedCollator
with the locale specific rules.
=20
eg.=20
=20
If (dealing with Germany) {
    // create ruleBasedCollator with additonal German specific collation
rules (like ss=3D=3D beta, or ue=3D=3D u with two dots on top, etc....) =
not in
UCA
    RuleBasedCollator rbc =3D new RuleBasedCollator(additonalRules);
}  else if (dealing with China) {
    // create ruleBasedCollator with additonal China specific collation
rules (like ss=3D=3D beta, or ue=3D=3D u with two dots on top, etc....) =
not in
UCA
    RuleBasedCollator rbc =3D new RuleBasedCollator(additonalRules);
} else if (...) etc for all countries supported=20
=20
I'm a bit worried that the character rules not in UCA (for example UCA
consider German's ss !=3D beta, ue!=3Du with two dots etc) are going to =
have
to be added by the developer using the collator.  I understand custom
rules are needed for circumstances where the string comparions desired
are NOT the norm for a particular country, but the ss/beta, ue/u with
two dots, etc. seem like it's standard in Germany.  Using this as
comparison, I'm assuming there's going to be cases for other countries
where standard collation rules are excluded from the UCA. =20
=20
This would also incidicate to me that I would have to do this for every
country I plan to support.  Which might not be too bad, but I would have
to find out what rules that are customary/standard for a particular
country and NOT in the UCA.  Once I have that info then I can create a
custom collator with those additonal rules tacked on.
=20
So, rather than having code like:
=20
 Collator myCollator =3D Collator.getInstance(myLocale);

if (myCollator.compare(string1, string2) =3D=3D 0) {
  doSomething(...);
}

I'd have a bunch of if else checks to create a collator with the
additonal rules not covered by UCA.
=20
What I'd like to do is be able to search through a text file for all
occurances of a given word.  That word might be spelled using different
characters (like heissen and heiBetassen) but I'd still like to be able
to find it.  Initially it seems like the only thing to do is grab a
collator given a locale and everything works out.  But based on these
emails, it seems I'd need to find out the rules not covered by UCA for
each country I plan to support and add them myself using a custom
RuleBasedCollator.
=20
Is this correct?
=20
Thanks for help!
Albert
=20
=20
=20
=20
=20
>So far all my reading (Oreilly Internationalization Book, various web
sites) seems to indicate that various languages do consider various
characters equal, so it is a big problem in internationization.  But,
the
code examples I've seen with IC4J seem to require custom Collators or
Locale's.

The Collator *is* the mechanism by which you compare characters
according to
the conventions of a particular locale. You normally just create a
collator
based on a locale, like this:

Collator myCollator =3D Collator.getInstance(myLocale);

You then use that collator to compare strings, e.g.

if (myCollator.compare(string1, string2) =3D=3D 0) {
  doSomething(...);
}

The code you cite will not do anything useful, since PHONEBOOK is only a
defined variant in some locales (German ones) where it makes sense.

It is very unclear exactly what you think you want to do and why. I also
suggest that you first read over Section 5.17 of the Unicode Standard
( http://www.unicode.org/unicode/uni2book/u2.html), plus the Collation
section of the ICU User's guide
( http://oss.software.ibm.com/icu/userguide/Collate_Intro.html)

Mark

RE: Problems matching characters based on locale

Open Source C/C++/Java libraries from Unicode

RE: Problems matching characters based on locale