From: Nickolay S. <sk...@bs...> - 2003-10-07 13:21:39
|
Hello, Peter, >> Current LIKE implementation doesn't use collations. Another problem is >> that if collations are used and INTL interface is unchanged we can >> forget about intelligent KMP algorithm. Best thing we can use is >> "brute force" :((( > For short strings, that wouldn't really be that bad. > For long string comparisons, you can use general regexp matching and > preprocess every character in the pattern to a group of all equivalent > characters in the collation: LIKE '%CAFE%' =3D>> > /.*[cC][aA=E4=C4=E1=C1=E0=C0][fF][eE=E9=C9=E8=C8].*/ General regexp mathing is very slow. > So, the INTL interface may need an addition for a painless retrieval of t= he > equivalence classes, but in O(n) speak, it's a minor detail as dumb or > smart retrieval of equivalence classes are both O(1) ;-) Things are not so simple. German letter b (written like greek beta) collates the same way as "ss" sequence. There are many other artefacts like this. Correct solution is to preprocess both patterns and source string the way simular to transformation used for indexing. But this requires some changes to INTL interface. BTW, my implementation of correct LIKE matching is in experimentation stage yet. So somebody else may address the problems. And I can share my ideas and results of experiments. The problems with string=5Fboolean implementation are: 1. LIKE pattern matching is extremely slow 2. collations are not used for string functions 3. BLOBS are processed incorrectly I think this problems should be addressed in complex as they are very tightly bound. I think of the following solution: 1. Implement single-pass pattern matching algorithm for LIKE (Knuth-Morris-Pratt algorithm with some extensions seems to fit the task perfectly) 2. Use callbacks in EVL=5Fxx=5Flike and EVL=5Fxx=5Fcontains functions 3. Add collation filter function to INTL ABI (and drop all like, merge and sleuth functions) that will use callback to fetch data and call callback function to give out results. This function should normalize string data the way so it can be used for pattern matching. This would solve all problems. I do not think that partial solutions are acceptable because they'll have to be dropped when other problems are addressed. Maybe you, Peter or Blas can pick up the this issues ? I can address them myself, but this will somewhat defer implementation of other points in my TODO. > Peter Jacobi --=20 Nickolay Samofatov mailto:sk...@bs... |