From: Wolfgang M. <wol...@ex...> - 2005-05-24 20:05:38
|
> Everybody has his own needs. IMHO, case insensitivity is foreign to the > XML spirit (despite the everyday usage of ISO-XXXX users... like me). Yes, but I also share Michael's argument that case insensitive matching is important if you want to search mixed content efficiently. I'm currently trying to figure out a good compromise that serves all users ;-) > To make everybody happy, see > http://sourceforge.net/tracker/index.php?func=detail&aid=1069335&group_id=17691&atid=367691 > > : the basic idea is to send a "stream" (element content, attribute > content, mixed-content, whatever in fact...) to an analyzer that, in > turn, generates positionned tokens in the index files (positionning is > important with ambiguous tokens, phrase queries...). > > Tokenization, transformation, filtering would so be the analyzer's job > and, thus, Lucene's contributers' one ;-) Integrating Lucene's analyzer is on my wish list too. I had a look at the sources a few days ago. I would really like to work on an integration, but I currently lack the time. > Sorry to come back on this issue with not even a little patch to help. > I'm currently tracing eXist's calls to try to understand its indexing > policy, but it seems that indexing code becomes more and more present > throughout eXist's low-level classes. The range index compares entire node values, it does not need tokenization. So the main class to be changed is NativeTextEngine. It uses the package org.exist.storage.analysis for tokenization, in particular, the Tokenizer interface. I think this is the point where Lucene's analyzer architecture would have to be plugged in, at least for a start. Wolfgang |