From: Syn W. <syn...@jt...> - 2001-07-31 19:01:37
|
Hi Geert-Jan, The next ICU release will include new APIs for searching patterns within unicode strings. The APIs makes use of collation to perform the matches. Hence you can tailor the rules and set the strength for matching. For details, please refer to the link below. http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/searchproposal .html Syn Wee Quek IBM GCoC, Cupertino, CA, USA ----- Original Message ----- From: "Geert-Jan van Opdorp" <op...@pi...> To: <ic...@os...> Sent: Tuesday, July 31, 2001 7:45 AM Subject: 'Primary normal form'? (pattern matching problem) > > Hi, > > I'm working on a pattern matcher that should be able to match using > primary collation. One problem I encounter is that (e.g in the en_us > locale) 'ss' should match U+00DF LATIN SMALL LETTER SHARP S. This > means that either one one-character-wildcard should match 'ss' or that > two one-character-wildcards should match U+00DF LATIN SMALL LETTER > SHARP S. Now I dont mind having to tell my users that they need two > wildcards to find U+00DF LATIN SMALL LETTER SHARP S, but I do want one > and the same searchpattern to find 'ss' and U+00DF LATIN SMALL LETTER > SHARP S. > > So it seems I need a collation-mode aware character-iterator, or, > better yet, some kind of normalized primary form, i.e. something > equivalent to the primary part of the collation key, but with > recognizable characterboundaries. > > It seems to me this must be a common problem - probably I am missing > the obvious somewhere. Any hints as how to solve this problem are very > welcome. > > Thanks > Geert-Jan > > Geert-Jan van Opdorp > op...@pi... > > _______________________________________________ > icu mailing list > ic...@os... > http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu > |