|
From: Irene V. <ire...@pa...> - 2024-12-03 10:25:02
|
Hi there! I am trying to add a custom Lucene analyzer that behaves like the WhitespaceAnalyzer concerning tokenization and (lack of) stemming, but that is also case-insensitive (basically the same of https://sourceforge.net/p/exist/mailman/message/35188378/). I followed what suggested in the post thread above, that is writing the custom analyzer, compiling its class as JAR and saving it in $EXIST_HOME/lib/user, but it is not working. I tried also putting it in the same folder of the other Lucene JAR files, but the same. Since both my Java/Lucene and eXist-db knowledge are quite poor, I am struggling to figure out whether the problem concerns my code or has to do with eXist-db itself. This is my custom analyzer code: package org.custom; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.WhitespaceTokenizer; public class CaseInsensitiveWhitespaceAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName) { final WhitespaceTokenizer source = new WhitespaceTokenizer(); final TokenStream filter = new LowerCaseFilter(source); return new TokenStreamComponents(source, filter); } } And this is how I reference to it in collection.xconf: <analyzer id="custom" class="org.custom.CaseInsensitiveWhitespaceAnalyzer"/> I also tested the analyzer outside eXist-db with the following and it returned the expected tokens: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.custom.CaseInsensitiveWhitespaceAnalyzer; import java.io.IOException; import java.io.StringReader; public class TestAnalyzer { public static void main(String[] args) throws IOException { String text = "Lucene is a Simple1 123 5% _test - Yet Powerful - Java Based Search Library. I love IT!"; Analyzer analyzer = new CaseInsensitiveWhitespaceAnalyzer(); try (TokenStream tokenStream = analyzer.tokenStream("field", new StringReader(text))) { CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { System.out.println(charTermAttribute.toString()); } tokenStream.end(); } } } What am I doing wrong? Any suggestions/hints will be highly appreciated :) Thanks, Irene |