RE: Problems matching characters based on locale

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Vladimir,

I was hoping to avoid something like:

if (locale.GERMAN.equals(locale) {

	collator = Collator.getInstance(new Locale("de", "", "PHONEBOOK"));
} else if (locale.CANADA.equals(locale) {
	collator = Collator.getInstance(new Locale("ca", "", "PHONEBOOK"));
} else if (...) {
etc...

I was hoping just to have a method that takes in a locale and I just pass it to a collator (without checking what country it is).

But after looking an Syn Wee's code again I could skip the if else checks on the country by doing:

String lang = myLocale.getLanguage();
String country = myLocale.getCountry();

Collator collator = Collator.getInstance(new Locale(lang, country, "PHONEBOOK"));
collator.setStrength(Collator.SECONDARY);
int indexValue =  collator.compare(text, pattern);

Do you know what this "PHONEBOOK" option is and can I use it for any locale like I have in the above code?  I can't find any info in the locale javadoc or the icu4J web site?

I'm hoping German is the only problematic language in this regard and other countries are relatively smoother.

Thanks for your help!

Albert

-----Original Message-----
From: Vladimir Weinstein [mailto:we...@jt...]
Sent: Wednesday, August 28, 2002 5:57 PM
To: Wong, Albert
Cc: syn wee; icu...@ww...
Subject: Re: Problems matching characters based on locale

Albert,

This depends on the locale you are using. UCA tries to give a sorting 
order that is correct for some languages, but it is obvious that you 
cannot have the same sorting order for every language. That is why UCA 
allows for tailoring.

The example given by Syn Wee looks very much like the code you have in 
your message, so I don't understand what is the problem there.

Tailorings that ICU provides are based on the national standards and 
information we got from native speakers. We do not claim that they are 
100% accurate and we try to fix bugs in tailorings. However, for most 
languages the rules are quite accurate.

However, some countries have more than one way of sorting things, as you 
have seen on the example of German. Standard German sorting order does 
not consirder ae and a-diaresis the same, while the sorting order in 
German phonebooks treats a-diaresis as a variant of 'ae'.

UCA is a standard that is meant to be tailored according to different 
national customs and standards.

Hope this helps.

Regards,
v.

Wong, Albert wrote:
> Hi Syn Wee
> 
> Thanks for your help.  This is probably a dumb question to ask but do you know if ICU4J supports any character matching (based on locale--eg. two different characters are considered equal) without any special customization.  So far all my reading (Oreilly Internationalization Book, various web sites) seems to indicate that various languages do consider various characters equal, so it is a big problem in internationization.  But, the code examples I've seen with IC4J seem to require custom Collators or Locale's.
> 
> Can I ever have generic code like this to match characters based on a locale:
> 
> Collator collator = Collator.getInstance(locale);
> collator.setStrength(Collator.SECONDARY);
> int indexValue =  collator.compare(text, pattern);
> 
> this would hopefully match characters using the specific locale.
> 
> Or, do I always have to create custom RuleBasedCollators, locales, etc.
> 
> So far it seems like the Collator will only consider things like "A" being equal to "a" or "A" but nothing like "ae" to "ä" without customization (which would be tough if your trying to support many different languages since you would have to know about the collation rules for each language)?
> 
> Thanks
> Albert
> 
> 
> -----Original Message-----
> From: syn wee [mailto:syn...@jt...]
> Sent: Wednesday, August 28, 2002 5:12 PM
> To: Wong, Albert; icu...@os...
> Subject: Re: Problems matching characters based on locale
> 
> 
> Hi Albert,
> 
> "ae" do not match "ä" in modern German and in the UCA data.
> Actually, "ae" has a tertiary difference from "ä" even in traditional
> German.
> To use traditional German you would have to use a variant locale provided by
> ICU4J, new Locale("de", "", "PHONEBOOK"). The modified version of your code
> below should set indexValue as 0.
> 
> import java.util.*;
> import com.ibm.icu.text.*;
> 
>  .....
> 
>  public static void main(String [] args) {
>      String text = "ä";
>      String pattern = "ae";
> 
>      Collator collator = Collator.getInstance(new Locale("de", "",
> "PHONEBOOK"));
>      collator.setStrength(Collator.SECONDARY);
>      int indexValue =  collator.compare(text, pattern);
> 
>  }
> 
> 
> ----- Original Message -----
> From: "Wong, Albert" <Alb...@we...>
> To: <icu...@os...>
> Sent: Wednesday, August 28, 2002 4:34 PM
> Subject: Problems matching characters based on locale
> 
> 
> 
>>Hi
>>
>>I've been using the Collator classes and not having much success at having
> 
> it match characters that are considered equal by a particular locale.  I'm
> assuming the sample input I have given are not considered equal by the UCA.
> 
>>I found an example from the ICU web site:
> 
> http://www-124.ibm.com/icu/userguide/Collate_Intro.html
> 
>>On this page it states
>>
>>--> A letter can be treated as if it were two letters. For example, in
> 
> traditional German "ä" is compared as if it were "ae".
> 
>>I tried this example (assuming it's a valid example):
>>
>>
>>import java.util.*;
>>import com.ibm.icu.text.*;
>>
>>.....
>>
>>public static void main(String [] args) {
>>    String text = "ä";
>>    String pattern = "ae";
>>
>>    Collator collator = Collator.getInstance(Locale.GERMANY);
>>    int indexValue =  collator.compare(text, pattern);
>>
>>}
>>
>>And it doesn't match.  I've tried Locale.GERMANY and Locale.GERMAN.  One
> 
> possibity is the UCA disallows characters to be equal based on locale?
> 
>>Does anybody see what I'm doing wrong?
>>
>>Thanks
>>Albert
>>
>>-----Original Message-----
>>From: Wong, Albert
>>Sent: Friday, August 23, 2002 4:40 PM
>>To: 'syn wee'; icu...@os...
>>Subject: RE: StringSearch not matching different representations of the
>>same character (eg. German)
>>
>>
>>Hi
>>
>>Is there a reason why the JDK's version of the RuleBasedCollator doesn't
> 
> follow the UCA while IBM's version does?  Both versions are written by IBM.
> 
>>Looking at the src code the most likely reason is the one being used in
> 
> the JDK1.3 and 1.4 was written in 1998, probably before the UCA was
> established?
> 
>>Is there a reason why Sun isn't upgrading it's version of the
> 
> RuleBasedCollator (to at least a version that supports the UAT?)
> 
>>Thanks
>>Albert
>>
>>-----Original Message-----
>>From: syn wee [mailto:syn...@jt...]
>>Sent: Tuesday, August 20, 2002 4:18 PM
>>To: Wong, Albert; icu...@os...
>>Subject: Re: StringSearch not matching different representations of the
>>same character (eg. German)
>>
>>
>>Albert,
>>
>>The German collator used by StringSearch considers "ss" to be smaller than
>>"\u00df" with a secondary difference. This behaviour is consistent with
> 
> UCA,
> 
>>please see the default UCA table at
>>
> 
> http://www.unicode.org/unicode/reports/tr10/#Default_Unicode_Collation_Eleme
> 
>>nt_Table.
>>
>>One way around this issue is to tailor your own collator for use.
>>E.g.
>>
>>CharacterIterator iterator = new StringCharacterIterator("Wie hei\u00DFen
>>Sie?");
>>String pattern = "heissen";
>>RuleBasedCollator rbc = new RuleBasedCollator("&ss = \u00df");
>>StringSearch iter = new StringSearch(pattern, iterator, rbc);
>>for (int pos = iter.first(); pos != SearchIterator.DONE; pos =
> 
> iter.next())
> 
>>{
>>             System.out.println("Found match at " + pos + ", length is "
>>                 + iter.getMatchLength());
>>}
>>
>>Another way, is to search using only PRIMARY_STRENGTH. However, searching
>>using PRIMARY_STRENGTH would ignore accents and case differences.
>>
>>----- Original Message -----
>>From: "Wong, Albert" <Alb...@we...>
>>To: <icu...@os...>
>>Sent: Tuesday, August 20, 2002 2:27 PM
>>Subject: StringSearch not matching different representations of the same
>>character (eg. German)
>>
>>
>>
>>>Hi
>>>
>>>I'm using StringSearch to match the german character "ss" with "\u00DF".
>>>
>>>CharacterIterator iterator = new StringCharacterIterator("Wie
>>
> hei\u00DFen
> 
>>Sie?");
>>
>>>String pattern = "heissen";
>>>StringSearch iter = new StringSearch(pattern, iterator,
>>
>>java.util.Locale.GERMAN);
>>
>>>for (int pos = iter.first(); pos != SearchIterator.DONE; pos =
>>
>>iter.next()) {
>>
>>>            System.out.println("Found match at " + pos + ", length is "
>>>                + iter.getMatchLength());
>>>}
>>>
>>>
>>>I find when using the StringSearch class it doesn't find a match.  "ss"
>>
>>and \u00DF are two representations of the same character in German.
>>
>>>Testing further, I downloaded the source code from Ch.7 of the Oreilly
>>
>>java internationalization book.  It provides an implementation of
>>String.indexOf():
>>
>>>public static int indexOf(String source, String pattern) {
>>>    // Obtain a collator
>>>    RuleBasedCollator rbc=(RuleBasedCollator)Collator.getInstance();
>>>    rbc.setStrength(Collator.SECONDARY);
>>>    System.err.println(rbc.getRules());
>>>
>>>    CollationElementIterator textCEI;
>>>    CollationElementIterator patCEI;
>>>    textCEI = rbc.getCollationElementIterator(source);
>>>    patCEI = rbc.getCollationElementIterator(pattern);
>>>
>>>    // e1 will contain the collation element for the source
>>>    // e2 will contain the collation element for the pattern
>>>    int e1, e2;
>>>    int startMatch = -1;
>>>
>>>    // initialize e2 with the first collation element in the pattern
>>>    e2 = patCEI.next();
>>>
>>>    while ((e1 = textCEI.next())!=CollationElementIterator.NULLORDER) {
>>>      if (e1 == e2) { // if the elements match
>>>        if (startMatch == -1) startMatch = textCEI.getOffset();
>>>        e2 = patCEI.next(); // increment to the next element
>>>        if (e2 == CollationElementIterator.NULLORDER)
>>>          break;
>>>      } else { // elements do not match
>>>        if (startMatch != -1) {
>>>          patCEI.reset();
>>>          e2 = patCEI.next();
>>>          startMatch = -1;
>>>        }
>>>      }
>>>    }
>>>    return startMatch;
>>>  }
>>>
>>>I found that if I import java.text.* the indexOf correctly identifies
>>
> "ss"
> 
>>as being equal to \u00DF.  If I switched to  com.ibm.icu.text.* then there
>>is no match.
>>
>>>Is there something else that I should be doing to initialize my
>>
>>StringSearch instance to search text using locale sensitive rules?  I'm
>>using JDK1.3.
>>
>>>Thanks
>>>Albert
>>>
>>>_______________________________________________
>>>icu4j-support mailing list
>>>icu...@os...
>>>
>>
> http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4j-suppor
> 
>>t
>>
>>_______________________________________________
>>icu4j-support mailing list
>>icu...@os...
>>
> 
> http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4j-suppor
> t
> 
> 
> _______________________________________________
> icu4j-support mailing list
> icu...@os...
> http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4j-support
> 

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  San Jose, CA we...@jt...

RE: Problems matching characters based on locale

Open Source C/C++/Java libraries from Unicode

RE: Problems matching characters based on locale