From: Wolfgang M. <sb...@we...> - 2002-02-14 17:22:59
|
Your right: the indexer is currently not very intelligent and treats punctuation characters as word boundaries. I planned to change this in the last release but finally I forgot. I will try to write a better tokenizer class if I have some spare time today or tomorrow. However it's not always easy to define a correct behaviour. Some punctuation characters actually mark word boundaries while others do not as in 'Q.4068'. I would welcome any suggestions about a good lexical analyzer we may use here. Wolfgang [snip] >> Now if I run any of the following xpath queries I get an empty result: >> collection('/test')/person[. &= 'My_city'] >> collection('/test')/person[. &= '5/46'] >> collection('/test')/person[. &= 'Q.4068'] >> collection('/test')/person[. &= '123\456'] >> >> But if run one of the following then the document is returned: >> collection('/test')/person[. &= 'My city'] >> collection('/test')/person[. &= '5 46'] >> collection('/test')/person[. &= 'Q 4068'] >> collection('/test')/person[. &= '123 456'] >> > >My data doesn't normally contain such characters, but I've created some >items that do, and I can reproduce the problem. I guess the indexer is >treating "punctuation" characters as word boundaries, so My_city is getting >indexed as two separate words, hence the ability to retrieve it only when >the underscore is left out. I fear this looks like a bug.... > >Michael |