[Exist-open] Lucene custom analyzer

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi there!

I am trying to add a custom Lucene analyzer that behaves like the WhitespaceAnalyzer concerning tokenization and (lack of) stemming, but that is also case-insensitive (basically the same of https://sourceforge.net/p/exist/mailman/message/35188378/).

I followed what suggested in the post thread above, that is writing the custom analyzer, compiling its class as JAR and saving it in $EXIST_HOME/lib/user, but it is not working. I tried also putting it in the same folder of the other Lucene JAR files, but the same.

Since both my Java/Lucene and eXist-db knowledge are quite poor, I am struggling to figure out whether the problem concerns my code or has to do with eXist-db itself.

This is my custom analyzer code:

package org.custom;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
public class CaseInsensitiveWhitespaceAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final WhitespaceTokenizer source = new WhitespaceTokenizer();
        final TokenStream filter = new LowerCaseFilter(source);
        return new TokenStreamComponents(source, filter);
    }
}

And this is how I reference to it in collection.xconf:
<analyzer id="custom" class="org.custom.CaseInsensitiveWhitespaceAnalyzer"/>

I also tested the analyzer outside eXist-db with the following and it returned the expected tokens:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.custom.CaseInsensitiveWhitespaceAnalyzer;
import java.io.IOException;
import java.io.StringReader;
public class TestAnalyzer {
    public static void main(String[] args) throws IOException {
        String text = "Lucene is a Simple1 123 5% _test - Yet Powerful - Java Based Search Library. I love IT!";
        Analyzer analyzer = new CaseInsensitiveWhitespaceAnalyzer();
        try (TokenStream tokenStream = analyzer.tokenStream("field", new StringReader(text))) {
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                System.out.println(charTermAttribute.toString());
            }
            tokenStream.end();
        }
    }
}

What am I doing wrong? Any suggestions/hints will be highly appreciated :)

Thanks,
Irene

[Exist-open] Lucene custom analyzer

eXist-db is a feature rich Open Source native XML database

[Exist-open] Lucene custom analyzer