We need to build some text search indexes on an existing journal (blazegraph 1.5.3, but eventually on 2 and above) after updating some Lucene properties in the journal to make more of the text searchable. I have figured out how to do this with a Java utility that runs on a journal that has been taken offline using LexiconRelation.rebuildTextIndex().
The offline solution will probably suffice but I'm curious if it is possible to rebuild the index while blazegraph is running. The standalone utility won't run when the journal is in use, but maybe it could be run in a separate thread in the blazegraph instance. Assume we won't be using bds:search until rebuild is complete. Will there be problems with this that affect any other blazegraph operations?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Support is being introduced in the next release for reindexing based on the
same code that you are using. This is being done to support a data
migration required for updated lucene dependencies. Brad can provide you
with some more information about this procedure, but the documentation for
it should be up on the wiki soon (if it is not already).
Blazegraph products help to solve the Graph Cache Thrash to achieve large
scale processing for graph and predictive analytics. Blazegraph is the
creator of the industry’s first GPU-accelerated high-performance database
for large graphs, has been named as one of the “10 Companies and
Technologies to Watch in 2016” http://insideanalysis.com/2016/01/20535/.
CONFIDENTIALITY NOTICE: This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP, LLC DBA Blazegraph. Any unauthorized review, use,
disclosure, dissemination or copying of this email or its contents or
attachments is prohibited. If you have received this communication in
error, please notify the sender by reply email and permanently delete all
copies of the email and its contents and attachments.
We need to build some text search indexes on an existing journal
(blazegraph 1.5.3, but eventually on 2 and above) after updating some
Lucene properties in the journal to make more of the text searchable. I
have figured out how to do this with a Java utility that runs on a journal
that has been taken offline using LexiconRelation.rebuildTextIndex().
The offline solution will probably suffice but I'm curious if it is
possible to rebuild the index while blazegraph is running. The standalone
utility won't run when the journal is in use, but maybe it could be run in
a separate thread in the blazegraph instance. Assume we won't be using
bds:search until rebuild is complete. Will there be problems with this that
affect any other blazegraph operations?
I had a chance to try out 2.1, but I get this exception a lot (see below). Lucene StandardAnalyzer is the value I assign to com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.analyzerClass. It happens sometimes (not always) when I use bds:search, but always happens when I try to rebuild the index from the namespace page.
Caused by: java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:315)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:110)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:143)
at com.bigdata.search.FullTextIndex.getTokenStream(FullTextIndex.java:883)
at com.bigdata.search.FullTextIndex.index(FullTextIndex.java:825)
at com.bigdata.search.FullTextIndex.tokenize(FullTextIndex.java:1041)
at com.bigdata.search.FullTextIndex._search(FullTextIndex.java:1143)
at com.bigdata.search.FullTextIndex.search(FullTextIndex.java:955)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.getHiterator(SearchServiceFactory.java:531)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:661)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:362)
at com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doBigdataServiceCall(ServiceCallJoin.java:770)
at com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doServiceCall(ServiceCallJoin.java:707)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I had a chance to try out 2.1, but I get this exception a lot (see below).
Lucene StandardAnalyzer is the value I assign to
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.analyzerClass. It
happens sometimes (not always) when I use bds:search, but always happens
when I try to rebuild the index from the namespace page.
Caused by: java.lang.IllegalStateException: TokenStream contract
violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at
org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:315)
at
org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:110)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:143)
at com.bigdata.search.FullTextIndex.getTokenStream(FullTextIndex.java:883)
at com.bigdata.search.FullTextIndex.index(FullTextIndex.java:825)
at com.bigdata.search.FullTextIndex.tokenize(FullTextIndex.java:1041)
at com.bigdata.search.FullTextIndex._search(FullTextIndex.java:1143)
at com.bigdata.search.FullTextIndex.search(FullTextIndex.java:955)
at
com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.getHiterator(SearchServiceFactory.java:531)
at
com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:661)
at
com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:362)
at
com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doBigdataServiceCall(ServiceCallJoin.java:770)
at
com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doServiceCall(ServiceCallJoin.java:707)
After adding Jeremy's patch, the reindexer appears to run, but I don't get any results back for bds:search. That's true whether I use the URL (from either the workbench page or curl) or if I run the standalone utility com.bigdata.rdf.store.RebuildTextIndex.
But it does work (I get results from bds:search) if I run my own code with these lines:
I also noticed that this method: lexiconRelation.rebuildTextIndex() has changed in 2.1 code, but it just takes a new argument to force creation of a new index. If I use this code snippet, built against the 2.1 jar, I also get results with bds:search. (But not with the documented index rebuilding tools.)
We need to build some text search indexes on an existing journal (blazegraph 1.5.3, but eventually on 2 and above) after updating some Lucene properties in the journal to make more of the text searchable. I have figured out how to do this with a Java utility that runs on a journal that has been taken offline using LexiconRelation.rebuildTextIndex().
The offline solution will probably suffice but I'm curious if it is possible to rebuild the index while blazegraph is running. The standalone utility won't run when the journal is in use, but maybe it could be run in a separate thread in the blazegraph instance. Assume we won't be using bds:search until rebuild is complete. Will there be problems with this that affect any other blazegraph operations?
Paul,
Support is being introduced in the next release for reindexing based on the
same code that you are using. This is being done to support a data
migration required for updated lucene dependencies. Brad can provide you
with some more information about this procedure, but the documentation for
it should be up on the wiki soon (if it is not already).
Thanks,
Bryan
Bryan Thompson
Chief Scientist & Founder
Blazegraph
e: bryan@blazegraph.com
w: http://blazegraph.com
Blazegraph products help to solve the Graph Cache Thrash to achieve large
scale processing for graph and predictive analytics. Blazegraph is the
creator of the industry’s first GPU-accelerated high-performance database
for large graphs, has been named as one of the “10 Companies and
Technologies to Watch in 2016” http://insideanalysis.com/2016/01/20535/.
Blazegraph Database https://www.blazegraph.com/ is our ultra-high
performance graph database that supports both RDF/SPARQL and
Tinkerpop/Blueprints APIs. Blazegraph GPU
https://www.blazegraph.com/product/gpu-accelerated/ andBlazegraph DAS
https://www.blazegraph.com/product/gpu-accelerated/L are disruptive new
technologies that use GPUs to enable extreme scaling that is thousands of
times faster and 40 times more affordable than CPU-based solutions.
CONFIDENTIALITY NOTICE: This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP, LLC DBA Blazegraph. Any unauthorized review, use,
disclosure, dissemination or copying of this email or its contents or
attachments is prohibited. If you have received this communication in
error, please notify the sender by reply email and permanently delete all
copies of the email and its contents and attachments.
On Thu, Mar 24, 2016 at 1:26 PM, Paul Callahan paulcsyapse@users.sf.net
wrote:
Thanks! That sounds very promising.
Paul,
Definitely. It will also update Blazegraph to Lucene 5.3.0:
https://jira.blazegraph.com/browse/BLZG-1328.
Thanks, --Brad
On Thu, Mar 24, 2016 at 3:42 PM, Paul Callahan paulcsyapse@users.sf.net
wrote:
The description in https://wiki.blazegraph.com/wiki/index.php/Rebuild_Text_Index_Procedure#Rebuild_Text_Index_Utility looks exactly like what I want, particularly the link in the workbench namespaces page. I just looked in v2.0.1, and it does not seem to be there. When will it be released?
Paul,
Thank you. 2.1.0 has cleared release testing and will be out this week.
Thanks, Brad
On Apr 4, 2016 10:35 AM, "Paul Callahan" paulcsyapse@users.sf.net wrote:
I had a chance to try out 2.1, but I get this exception a lot (see below). Lucene StandardAnalyzer is the value I assign to com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.analyzerClass. It happens sometimes (not always) when I use bds:search, but always happens when I try to rebuild the index from the namespace page.
Caused by: java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:315)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:110)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:143)
at com.bigdata.search.FullTextIndex.getTokenStream(FullTextIndex.java:883)
at com.bigdata.search.FullTextIndex.index(FullTextIndex.java:825)
at com.bigdata.search.FullTextIndex.tokenize(FullTextIndex.java:1041)
at com.bigdata.search.FullTextIndex._search(FullTextIndex.java:1143)
at com.bigdata.search.FullTextIndex.search(FullTextIndex.java:955)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.getHiterator(SearchServiceFactory.java:531)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:661)
at com.bigdata.rdf.sparql.ast.eval.SearchServiceFactory$SearchCall.call(SearchServiceFactory.java:362)
at com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doBigdataServiceCall(ServiceCallJoin.java:770)
at com.bigdata.bop.controller.ServiceCallJoin$ChunkTask$ServiceCallTask.doServiceCall(ServiceCallJoin.java:707)
Yes. See BLZG-1876. A fix just went through CI (thanks Jeremy!)
Bryan
On Apr 11, 2016 7:28 PM, "Paul Callahan" paulcsyapse@users.sf.net wrote:
After adding Jeremy's patch, the reindexer appears to run, but I don't get any results back for bds:search. That's true whether I use the URL (from either the workbench page or curl) or if I run the standalone utility com.bigdata.rdf.store.RebuildTextIndex.
But it does work (I get results from bds:search) if I run my own code with these lines:
What is the blazegraph 2.1 code doing that's different?
Last edit: Paul Callahan 2016-04-13
I also noticed that this method: lexiconRelation.rebuildTextIndex() has changed in 2.1 code, but it just takes a new argument to force creation of a new index. If I use this code snippet, built against the 2.1 jar, I also get results with bds:search. (But not with the documented index rebuilding tools.)
Last edit: Paul Callahan 2016-04-13