From: Efraim F. <efr...@gm...> - 2011-05-08 16:41:57
|
Hi, I'm having some issues setting up a custom tokenizer/analyzer in eXist, and I wanted to know if I'm doing anything obviously wrong. Here is what I did: Download and compile hebmorph and its java Lucene analyzer from git (https://github.com/synhershko/HebMorph). start with an install from a clean copy of eXist 1.5 r14395 copy lucene.hebrew.jar to $EXIST_HOME/extensions/indexes/lucene/lib/ In conf.xml, change the indexer line to: <indexer caseSensitive="yes" index-depth="5" preserve-whitespace-mixed-content="no" stemming="no" suppress-whitespace="none" tokenizer="org.apache.lucene.analysis.hebrew.HebrewTokenizer" track-term-freq="yes"> In the database's /db/system/config/db/collection.xconf, I set: <collection xmlns="http://exist-db.org/collection-config/1.0"> <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:j="http://jewishliturgy.org/ns/jlptei/1.0"> <fulltext default="none" attributes="no"/> <lucene> <analyzer class="org.apache.lucene.analysis.hebrew.MorphAnalyzer"/> <inline qname="tei:c"/> <text qname="tei:title"/> <text qname="j:repository"/> <text qname="tei:seg"/> </lucene> </index> </collection> Add some texts to the database, under the /db/text collection From the eXist Java client, run the XQuery: declare namespace tei="http://www.tei-c.org/ns/1.0"; collection('/db/text')//tei:title[ft:query(.,'בראשית')] The result is a null pointer exception, where the stack trace looks like it never tries to use the custom tokenizer or analyzer: org.xmldb.api.base.XMLDBException at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:398) at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:145) at org.exist.client.QueryDialog$QueryThread.run(QueryDialog.java:531) Caused by: java.lang.NullPointerException at org.apache.lucene.analysis.Tokenizer.close(Tokenizer.java:71) at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:46) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:619) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1449) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1337) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200) at org.exist.indexing.lucene.LuceneIndexWorker.query(LuceneIndexWorker.java:352) at org.exist.xquery.modules.lucene.Query.preSelect(Query.java:198) at org.exist.xquery.pragmas.Optimize.eval(Optimize.java:118) at org.exist.xquery.ExtensionExpression.eval(ExtensionExpression.java:70) at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:70) at org.exist.xquery.PathExpr.eval(PathExpr.java:243) at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:70) at org.exist.xquery.XQuery.execute(XQuery.java:246) at org.exist.xquery.XQuery.execute(XQuery.java:201) at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:391) ... 2 more Caused by: java.lang.NullPointerException at org.apache.lucene.analysis.Tokenizer.close(Tokenizer.java:71) at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:46) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:619) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1449) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1337) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200) at org.exist.indexing.lucene.LuceneIndexWorker.query(LuceneIndexWorker.java:352) at org.exist.xquery.modules.lucene.Query.preSelect(Query.java:198) at org.exist.xquery.pragmas.Optimize.eval(Optimize.java:118) at org.exist.xquery.ExtensionExpression.eval(ExtensionExpression.java:70) at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:70) at org.exist.xquery.PathExpr.eval(PathExpr.java:243) at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:70) at org.exist.xquery.XQuery.execute(XQuery.java:246) at org.exist.xquery.XQuery.execute(XQuery.java:201) at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:391) at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:145) at org.exist.client.QueryDialog$QueryThread.run(QueryDialog.java:531) Am I doing anything obviously wrong from the eXist side? Thanks, -- --- Efraim Feinstein Lead Developer Open Siddur Project http://opensiddur.net http://wiki.jewishliturgy.org |
From: Wolfgang M. <wol...@ex...> - 2011-05-08 18:48:55
|
> I'm having some issues setting up a custom tokenizer/analyzer in eXist, > and I wanted to know if I'm doing anything obviously wrong. I can't see anything obviously wrong in the lucene index configuration, but the tokenizer setting in conf.xml is for eXist's old (now deprecated) full-text index and won't work with the lucene tokenizer. Does it help if you reset that to the old setting? I don't expect it does. Otherwise we will need to try out the analyzer to see where the NPE occurs exactly. Wolfgang |
From: Efraim F. <efr...@gm...> - 2011-05-08 19:14:07
|
Hi, On 05/08/2011 02:48 PM, Wolfgang Meier wrote: >> I'm having some issues setting up a custom tokenizer/analyzer in eXist, >> and I wanted to know if I'm doing anything obviously wrong. > I can't see anything obviously wrong in the lucene index > configuration, but the tokenizer setting in conf.xml is for eXist's > old (now deprecated) full-text index and won't work with the lucene > tokenizer. Does it help if you reset that to the old setting? I don't > expect it does. ֪Changing the setting back did remove the NPE! Thanks, -- --- Efraim Feinstein Lead Developer Open Siddur Project http://opensiddur.net http://wiki.jewishliturgy.org |
From: Efraim F. <efr...@gm...> - 2011-05-08 23:47:22
|
On 05/08/2011 02:48 PM, Wolfgang Meier wrote: >> I'm having some issues setting up a custom tokenizer/analyzer in eXist, >> and I wanted to know if I'm doing anything obviously wrong. > I can't see anything obviously wrong in the lucene index > configuration, but the tokenizer setting in conf.xml is for eXist's > old (now deprecated) full-text index and won't work with the lucene > tokenizer. Does it help if you reset that to the old setting? I don't > expect it does. Unfortunately, I spoke too soon on this one. When I reinstalled the db, I had forgotten to copy the analyzer into the classpath. When it's there, I still get the same NPE. I also get an additional error when trying to store a file in the database through the admin client: "Impossible to store a resource [path]: null" The resource appears anyway, and there's no exception in the logs. Thanks, -- --- Efraim Feinstein Lead Developer Open Siddur Project http://opensiddur.net http://wiki.jewishliturgy.org |
From: Joe W. <jo...@gm...> - 2011-05-09 00:45:58
|
Hi Efraim, I notice that whereas eXist uses Lucene 2.9.2, the Hebrew analyzer's default version is Lucene 3.0.2 - see the lib folder inside of: https://github.com/synhershko/HebMorph/tree/master/java/lucene.hebrew It also appears from the commit logs that there was some effort to backport to "2.9", but again the lib folder contains 2.9.3 - close but still newer than eXist's 2.9.2. It might be worth finding out if this version is going to be compatible with the 2.9.2 release of Lucene. Another line of inquiry you might track is whether the extra steps performed by Adam in the case of the Snowball analyzer might be necessary (see http://markmail.org/message/gcepf56nkc5huck6); I think Adam's steps render Mike's steps unnecessary but I'm not sure (http://markmail.org/message/kmpetl2leq457t5i). I anticipate using Hebrew for an upcoming side project, so I'm following this with interest - and I would be happy to confirm tests on sample texts. Cheers, Joe On Sun, May 8, 2011 at 7:47 PM, Efraim Feinstein <efr...@gm...> wrote: > On 05/08/2011 02:48 PM, Wolfgang Meier wrote: >>> I'm having some issues setting up a custom tokenizer/analyzer in eXist, >>> and I wanted to know if I'm doing anything obviously wrong. >> I can't see anything obviously wrong in the lucene index >> configuration, but the tokenizer setting in conf.xml is for eXist's >> old (now deprecated) full-text index and won't work with the lucene >> tokenizer. Does it help if you reset that to the old setting? I don't >> expect it does. > > Unfortunately, I spoke too soon on this one. When I reinstalled the db, > I had forgotten to copy the analyzer into the classpath. When it's > there, I still get the same NPE. I also get an additional error when > trying to store a file in the database through the admin client: > "Impossible to store a resource [path]: null" > > The resource appears anyway, and there's no exception in the logs. > > Thanks, > > > -- > --- > Efraim Feinstein > Lead Developer > Open Siddur Project > http://opensiddur.net > http://wiki.jewishliturgy.org > > > ------------------------------------------------------------------------------ > WhatsUp Gold - Download Free Network Management Software > The most intuitive, comprehensive, and cost-effective network > management toolset available today. Delivers lowest initial > acquisition cost and overall TCO of any competing solution. > http://p.sf.net/sfu/whatsupgold-sd > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > |
From: Efraim F. <efr...@gm...> - 2011-05-09 04:09:44
|
Hi, On 05/08/2011 08:45 PM, Joe Wicentowski wrote: > It also appears from the commit logs that there was some effort to > backport to "2.9", but again the lib folder contains 2.9.3 - close but > still newer than eXist's 2.9.2. > > It might be worth finding out if this version is going to be > compatible with the 2.9.2 release of Lucene. That was a good lead. I tried to compile hebmorph against the same Lucene libs that are distributed with eXist, and got the same NPE on hebmorph running its unit tests, so, it looks like this isn't an eXist-specific issue. Thanks, -- --- Efraim Feinstein Lead Developer Open Siddur Project http://opensiddur.net http://wiki.jewishliturgy.org |
From: Martin H. <mh...@uv...> - 2011-05-11 20:13:07
|
> Another line of inquiry you might track is whether the extra steps > performed by Adam in the case of the Snowball analyzer might be > necessary (see http://markmail.org/message/gcepf56nkc5huck6); I think > Adam's steps render Mike's steps unnecessary but I'm not sure > (http://markmail.org/message/kmpetl2leq457t5i). I tried Adam's patch, but I couldn't get eXist to compile with the patch included. Did anyone else get it working? This is the patch file Adam sent me: +import java.lang.reflect.Constructor; +import java.lang.reflect.Field; +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.TypeVariable; import org.w3c.dom.Element; import org.exist.util.DatabaseConfigurationException; import org.apache.lucene.analysis.Analyzer; import java.util.Map; import java.util.TreeMap; +import org.w3c.dom.NamedNodeMap; +import org.w3c.dom.Node; +import org.w3c.dom.NodeList; public class AnalyzerConfig { @@ -24,23 +31,78 @@ } public void addAnalyzer(Element config) throws DatabaseConfigurationException { - String id = config.getAttribute(ID_ATTRIBUTE); Analyzer analyzer = configureAnalyzer(config); - if (id == null || id.length() == 0) + String id = config.getAttribute(ID_ATTRIBUTE); + if (id == null || id.length() == 0) { defaultAnalyzer = analyzer; - else + } else { analyzers.put(id, analyzer); } + } protected static Analyzer configureAnalyzer(Element config) throws DatabaseConfigurationException { String className = config.getAttribute(CLASS_ATTRIBUTE); if (className != null && className.length() != 0) { try { Class<?> clazz = Class.forName(className); - if (!Analyzer.class.isAssignableFrom(clazz)) - throw new DatabaseConfigurationException("Lucene index: analyzer class has to be" + - " a subclass of " + Analyzer.class.getName()); - return (Analyzer) clazz.newInstance(); + if (!Analyzer.class.isAssignableFrom(clazz)) { + throw new DatabaseConfigurationException("Lucene index: analyzer class has to be a subclass of " + Analyzer.class.getName()); + } + + NodeList params = config.getElementsByTagName("param"); + if(params.getLength() == 0){ + return (Analyzer)clazz.newInstance(); + } else { + + Object args[] = new Object[params.getLength()]; + + for(Constructor constructor : clazz.getConstructors()) { + TypeVariable typeVars[] = constructor.getTypeParameters(); + if(typeVars.length == params.getLength()) { + + boolean matched = false; + //found a constructor of the same length + for(int i = 0; i < typeVars.length; i++) { + Node param = params.item(i); + + NamedNodeMap attrs = param.getAttributes(); + String name = attrs.getNamedItem("name").getNodeValue(); + String type = attrs.getNamedItem("type").getNodeValue(); + String value = attrs.getNamedItem("value").getNodeValue(); + + //either field or string - could be extended + if(type != null && type.equals("java.lang.reflect.Field")){ + String clazzName = value.substring(0, name.lastIndexOf(".")); + String fieldName = value.substring(name.indexOf(".") + 1); + + Class fieldClazz = Class.forName(clazzName); + Field field = fieldClazz.getField(fieldName); + + //does the field type match the constructor var type? + if(field.getType().getName().equals(typeVars[i].getName())) { + args[i] = field.get(fieldClazz.newInstance()); + matched = true; + } else { + matched = false; + break; + } + } else if(typeVars[i].getName().equals("java.lang.String")) { + args[i] = value; + matched = true; + } else { + matched = false; + break; + } + } + + if(matched) { + return (Analyzer) constructor.newInstance(args); + } + } + } + } + + } catch (ClassNotFoundException e) { throw new DatabaseConfigurationException("Lucene index: analyzer class " + className + " not found."); @@ -50,6 +112,12 @@ } catch (InstantiationException e) { throw new DatabaseConfigurationException("Exception while instantiating analyzer class " + className + ": " + e.getMessage(), e); + } catch(InvocationTargetException e) { + throw new DatabaseConfigurationException("Exception while instantiating analyzer class " + + className + ": " + e.getMessage(), e); + } catch(NoSuchFieldException e) { + throw new DatabaseConfigurationException("Exception while instantiating analyzer class " + + className + ": " + e.getMessage(), e); } } return null; Cheers, Martin On 11-05-08 05:45 PM, Joe Wicentowski wrote: > Hi Efraim, > > I notice that whereas eXist uses Lucene 2.9.2, the Hebrew analyzer's > default version is Lucene 3.0.2 - see the lib folder inside of: > > https://github.com/synhershko/HebMorph/tree/master/java/lucene.hebrew > > It also appears from the commit logs that there was some effort to > backport to "2.9", but again the lib folder contains 2.9.3 - close but > still newer than eXist's 2.9.2. > > It might be worth finding out if this version is going to be > compatible with the 2.9.2 release of Lucene. > > Another line of inquiry you might track is whether the extra steps > performed by Adam in the case of the Snowball analyzer might be > necessary (see http://markmail.org/message/gcepf56nkc5huck6); I think > Adam's steps render Mike's steps unnecessary but I'm not sure > (http://markmail.org/message/kmpetl2leq457t5i). > > I anticipate using Hebrew for an upcoming side project, so I'm > following this with interest - and I would be happy to confirm tests > on sample texts. > > Cheers, > Joe > > > On Sun, May 8, 2011 at 7:47 PM, Efraim Feinstein > <efr...@gm...> wrote: >> On 05/08/2011 02:48 PM, Wolfgang Meier wrote: >>>> I'm having some issues setting up a custom tokenizer/analyzer in eXist, >>>> and I wanted to know if I'm doing anything obviously wrong. >>> I can't see anything obviously wrong in the lucene index >>> configuration, but the tokenizer setting in conf.xml is for eXist's >>> old (now deprecated) full-text index and won't work with the lucene >>> tokenizer. Does it help if you reset that to the old setting? I don't >>> expect it does. >> >> Unfortunately, I spoke too soon on this one. When I reinstalled the db, >> I had forgotten to copy the analyzer into the classpath. When it's >> there, I still get the same NPE. I also get an additional error when >> trying to store a file in the database through the admin client: >> "Impossible to store a resource [path]: null" >> >> The resource appears anyway, and there's no exception in the logs. >> >> Thanks, >> >> >> -- >> --- >> Efraim Feinstein >> Lead Developer >> Open Siddur Project >> http://opensiddur.net >> http://wiki.jewishliturgy.org >> >> >> ------------------------------------------------------------------------------ >> WhatsUp Gold - Download Free Network Management Software >> The most intuitive, comprehensive, and cost-effective network >> management toolset available today. Delivers lowest initial >> acquisition cost and overall TCO of any competing solution. >> http://p.sf.net/sfu/whatsupgold-sd >> _______________________________________________ >> Exist-open mailing list >> Exi...@li... >> https://lists.sourceforge.net/lists/listinfo/exist-open >> > > ------------------------------------------------------------------------------ > WhatsUp Gold - Download Free Network Management Software > The most intuitive, comprehensive, and cost-effective network > management toolset available today. Delivers lowest initial > acquisition cost and overall TCO of any competing solution. > http://p.sf.net/sfu/whatsupgold-sd > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open |