From: Wong, A. <Alb...@we...> - 2002-08-29 03:00:17
|
Also, 1. I've tried setting strength to Secondary.. still fails If the example on the ICU web site also fails to match according to the UCA can anybody give me an example of 2 different characters matching up based on locale? So far, nothing I've tried seems to match. Thanks Albert -----Original Message----- From: Wong, Albert Sent: Wednesday, August 28, 2002 4:34 PM To: 'icu...@os...' Subject: Problems matching characters based on locale Hi I've been using the Collator classes and not having much success at having it match characters that are considered equal by a particular locale. I'm assuming the sample input I have given are not considered equal by the UCA. I found an example from the ICU web site: http://www-124.ibm.com/icu/userguide/Collate_Intro.html On this page it states --> A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae". I tried this example (assuming it's a valid example): import java.util.*; import com.ibm.icu.text.*; ..... public static void main(String [] args) { String text = "ä"; String pattern = "ae"; Collator collator = Collator.getInstance(Locale.GERMANY); int indexValue = collator.compare(text, pattern); } And it doesn't match. I've tried Locale.GERMANY and Locale.GERMAN. One possibity is the UCA disallows characters to be equal based on locale? Does anybody see what I'm doing wrong? Thanks Albert -----Original Message----- From: Wong, Albert Sent: Friday, August 23, 2002 4:40 PM To: 'syn wee'; icu...@os... Subject: RE: StringSearch not matching different representations of the same character (eg. German) Hi Is there a reason why the JDK's version of the RuleBasedCollator doesn't follow the UCA while IBM's version does? Both versions are written by IBM. Looking at the src code the most likely reason is the one being used in the JDK1.3 and 1.4 was written in 1998, probably before the UCA was established? Is there a reason why Sun isn't upgrading it's version of the RuleBasedCollator (to at least a version that supports the UAT?) Thanks Albert -----Original Message----- From: syn wee [mailto:syn...@jt...] Sent: Tuesday, August 20, 2002 4:18 PM To: Wong, Albert; icu...@os... Subject: Re: StringSearch not matching different representations of the same character (eg. German) Albert, The German collator used by StringSearch considers "ss" to be smaller than "\u00df" with a secondary difference. This behaviour is consistent with UCA, please see the default UCA table at http://www.unicode.org/unicode/reports/tr10/#Default_Unicode_Collation_Eleme nt_Table. One way around this issue is to tailor your own collator for use. E.g. CharacterIterator iterator = new StringCharacterIterator("Wie hei\u00DFen Sie?"); String pattern = "heissen"; RuleBasedCollator rbc = new RuleBasedCollator("&ss = \u00df"); StringSearch iter = new StringSearch(pattern, iterator, rbc); for (int pos = iter.first(); pos != SearchIterator.DONE; pos = iter.next()) { System.out.println("Found match at " + pos + ", length is " + iter.getMatchLength()); } Another way, is to search using only PRIMARY_STRENGTH. However, searching using PRIMARY_STRENGTH would ignore accents and case differences. ----- Original Message ----- From: "Wong, Albert" <Alb...@we...> To: <icu...@os...> Sent: Tuesday, August 20, 2002 2:27 PM Subject: StringSearch not matching different representations of the same character (eg. German) > > Hi > > I'm using StringSearch to match the german character "ss" with "\u00DF". > > CharacterIterator iterator = new StringCharacterIterator("Wie hei\u00DFen Sie?"); > String pattern = "heissen"; > StringSearch iter = new StringSearch(pattern, iterator, java.util.Locale.GERMAN); > for (int pos = iter.first(); pos != SearchIterator.DONE; pos = iter.next()) { > System.out.println("Found match at " + pos + ", length is " > + iter.getMatchLength()); > } > > > I find when using the StringSearch class it doesn't find a match. "ss" and \u00DF are two representations of the same character in German. > > Testing further, I downloaded the source code from Ch.7 of the Oreilly java internationalization book. It provides an implementation of String.indexOf(): > > public static int indexOf(String source, String pattern) { > // Obtain a collator > RuleBasedCollator rbc=(RuleBasedCollator)Collator.getInstance(); > rbc.setStrength(Collator.SECONDARY); > System.err.println(rbc.getRules()); > > CollationElementIterator textCEI; > CollationElementIterator patCEI; > textCEI = rbc.getCollationElementIterator(source); > patCEI = rbc.getCollationElementIterator(pattern); > > // e1 will contain the collation element for the source > // e2 will contain the collation element for the pattern > int e1, e2; > int startMatch = -1; > > // initialize e2 with the first collation element in the pattern > e2 = patCEI.next(); > > while ((e1 = textCEI.next())!=CollationElementIterator.NULLORDER) { > if (e1 == e2) { // if the elements match > if (startMatch == -1) startMatch = textCEI.getOffset(); > e2 = patCEI.next(); // increment to the next element > if (e2 == CollationElementIterator.NULLORDER) > break; > } else { // elements do not match > if (startMatch != -1) { > patCEI.reset(); > e2 = patCEI.next(); > startMatch = -1; > } > } > } > return startMatch; > } > > I found that if I import java.text.* the indexOf correctly identifies "ss" as being equal to \u00DF. If I switched to com.ibm.icu.text.* then there is no match. > > Is there something else that I should be doing to initialize my StringSearch instance to search text using locale sensitive rules? I'm using JDK1.3. > > Thanks > Albert > > _______________________________________________ > icu4j-support mailing list > icu...@os... > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4j-suppor t > |