RE: Problems matching characters based on locale

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Also,

1.  I've tried setting strength to Secondary.. still fails

If the example on the ICU web site also fails to match according to the UCA can anybody give me an example of 2 different characters matching up based on locale?  So far, nothing I've tried seems to match.

Thanks
Albert

-----Original Message-----
From: Wong, Albert 
Sent: Wednesday, August 28, 2002 4:34 PM
To: 'icu...@os...'
Subject: Problems matching characters based on locale

Hi

I've been using the Collator classes and not having much success at having it match characters that are considered equal by a particular locale.  I'm assuming the sample input I have given are not considered equal by the UCA.

I found an example from the ICU web site:  http://www-124.ibm.com/icu/userguide/Collate_Intro.html

On this page it states

--> A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae". 

I tried this example (assuming it's a valid example):

import java.util.*;
import com.ibm.icu.text.*;

.....

public static void main(String [] args) {
    String text = "ä";
    String pattern = "ae";

    Collator collator = Collator.getInstance(Locale.GERMANY);
    int indexValue =  collator.compare(text, pattern);

}

And it doesn't match.  I've tried Locale.GERMANY and Locale.GERMAN.  One possibity is the UCA disallows characters to be equal based on locale?

Does anybody see what I'm doing wrong?

Thanks
Albert

-----Original Message-----
From: Wong, Albert 
Sent: Friday, August 23, 2002 4:40 PM
To: 'syn wee'; icu...@os...
Subject: RE: StringSearch not matching different representations of the
same character (eg. German)

Hi 

Is there a reason why the JDK's version of the RuleBasedCollator doesn't follow the UCA while IBM's version does?  Both versions are written by IBM.

Looking at the src code the most likely reason is the one being used in the JDK1.3 and 1.4 was written in 1998, probably before the UCA was established?

Is there a reason why Sun isn't upgrading it's version of the RuleBasedCollator (to at least a version that supports the UAT?)

Thanks
Albert

-----Original Message-----
From: syn wee [mailto:syn...@jt...]
Sent: Tuesday, August 20, 2002 4:18 PM
To: Wong, Albert; icu...@os...
Subject: Re: StringSearch not matching different representations of the
same character (eg. German)

Albert,

The German collator used by StringSearch considers "ss" to be smaller than
"\u00df" with a secondary difference. This behaviour is consistent with UCA,
please see the default UCA table at
http://www.unicode.org/unicode/reports/tr10/#Default_Unicode_Collation_Eleme
nt_Table.

One way around this issue is to tailor your own collator for use.
E.g.

CharacterIterator iterator = new StringCharacterIterator("Wie hei\u00DFen
Sie?");
String pattern = "heissen";
RuleBasedCollator rbc = new RuleBasedCollator("&ss = \u00df");
StringSearch iter = new StringSearch(pattern, iterator, rbc);
for (int pos = iter.first(); pos != SearchIterator.DONE; pos = iter.next())
{
             System.out.println("Found match at " + pos + ", length is "
                 + iter.getMatchLength());
}

Another way, is to search using only PRIMARY_STRENGTH. However, searching
using PRIMARY_STRENGTH would ignore accents and case differences.

----- Original Message -----
From: "Wong, Albert" <Alb...@we...>
To: <icu...@os...>
Sent: Tuesday, August 20, 2002 2:27 PM
Subject: StringSearch not matching different representations of the same
character (eg. German)

>
> Hi
>
> I'm using StringSearch to match the german character "ss" with "\u00DF".
>
> CharacterIterator iterator = new StringCharacterIterator("Wie hei\u00DFen
Sie?");
> String pattern = "heissen";
> StringSearch iter = new StringSearch(pattern, iterator,
java.util.Locale.GERMAN);
> for (int pos = iter.first(); pos != SearchIterator.DONE; pos =
iter.next()) {
>             System.out.println("Found match at " + pos + ", length is "
>                 + iter.getMatchLength());
> }
>
>
> I find when using the StringSearch class it doesn't find a match.  "ss"
and \u00DF are two representations of the same character in German.
>
> Testing further, I downloaded the source code from Ch.7 of the Oreilly
java internationalization book.  It provides an implementation of
String.indexOf():
>
> public static int indexOf(String source, String pattern) {
>     // Obtain a collator
>     RuleBasedCollator rbc=(RuleBasedCollator)Collator.getInstance();
>     rbc.setStrength(Collator.SECONDARY);
>     System.err.println(rbc.getRules());
>
>     CollationElementIterator textCEI;
>     CollationElementIterator patCEI;
>     textCEI = rbc.getCollationElementIterator(source);
>     patCEI = rbc.getCollationElementIterator(pattern);
>
>     // e1 will contain the collation element for the source
>     // e2 will contain the collation element for the pattern
>     int e1, e2;
>     int startMatch = -1;
>
>     // initialize e2 with the first collation element in the pattern
>     e2 = patCEI.next();
>
>     while ((e1 = textCEI.next())!=CollationElementIterator.NULLORDER) {
>       if (e1 == e2) { // if the elements match
>         if (startMatch == -1) startMatch = textCEI.getOffset();
>         e2 = patCEI.next(); // increment to the next element
>         if (e2 == CollationElementIterator.NULLORDER)
>           break;
>       } else { // elements do not match
>         if (startMatch != -1) {
>           patCEI.reset();
>           e2 = patCEI.next();
>           startMatch = -1;
>         }
>       }
>     }
>     return startMatch;
>   }
>
> I found that if I import java.text.* the indexOf correctly identifies "ss"
as being equal to \u00DF.  If I switched to  com.ibm.icu.text.* then there
is no match.
>
> Is there something else that I should be doing to initialize my
StringSearch instance to search text using locale sensitive rules?  I'm
using JDK1.3.
>
> Thanks
> Albert
>
> _______________________________________________
> icu4j-support mailing list
> icu...@os...
>
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4j-suppor
t
>

RE: Problems matching characters based on locale

Open Source C/C++/Java libraries from Unicode

RE: Problems matching characters based on locale