From: Nickolay S. <sk...@bs...> - 2003-10-07 14:51:20
|
Hello, Peter ! >> Things are not so simple. German letter b (written like greek beta) >> collates the same way as "ss" sequence. > As a German I can tell you, that this is about 25% true on a scale of 0% = to > 100% true-ity. And if this is the only case holding back the > implementation, we can argue to further decrease the true-ity level. This is not the only case. Think of DOCBOOK collation implemented as an example in Dave's collkit. >> There are many other artefacts >> like this. Correct solution is to preprocess both patterns and source >> string the way simular to transformation used for indexing. But this >> requires some changes to INTL interface. > Can you elaborate? I would like to see this work in some way, but for the > multi-level collations the sortkey returned consists of 2-4 parts and so = it > won't be of any direct use string searching: > E.g: Caf=E9teria will return CAFETERIA333433333211111111 > and Caf=E9 will return CAFE33342111, > and for obvious reasons the latter isn't a substring of former one. > There is already a unused (?) but designed interface in INTL, to return > only the primary differences ('partial'), then > Caf=E9teria will return CAFETERIA and Caf=E9 will return CAFE33342111, > so that would fit the bill for nocase/noaccent substring searching. Needed transformation should return canonical representation of string in terms of string equality. For example, if our string is "Caf=E9": 1) if collation is case-sensitive and accent-sensitive it should return "Caf=E9". 2) If it is case-sensitive and accent-insensitive it should return "Cafe" 3) If it is case-insensitive and accent-insensitive it should return "CAFE" If collation threats german "b" as "ss" it should return "strasse" for string "strabe", etc... Got the idea ? We need transformator of string data to canonical representation that may be used for pattern-matching. > Peter Jacobi --=20 Nickolay Samofatov mailto:sk...@bs... |