|
From: Bryan T. <br...@sy...> - 2015-11-13 22:31:32
|
The query engine will make its own decision about when different parts of the query run. If you want something to run first, you need to indicate this explicitly with either a query hint or by disabling the query optimizer. You can not have duplicate triples in the database. If you are in quads mode, then there could be identical triples in different named graphs. But this is a separate question from how evaluation is occurring. I would be surprised if DISTINCT were necessary with just the two triples in each endpoint. But with many triples in the two endpoints, I think that it is necessary since the SERVICE is not otherwise guaranteed to be invoked exactly once and the same ?concept could be submitted into different invocations of the service, in which case duplicates would occur. To avoid duplicates, force the SERVICE call to run exactly once (which might slow down the time to the first solution for query if there are a lot of solutions for the first triple pattern), or use DISTINCT. Thanks, Bryan ---- Bryan Thompson Chief Scientist & Founder SYSTAP, LLC 4501 Tower Road Greensboro, NC 27410 br...@sy... http://blazegraph.com http://blog.blazegraph.com Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints APIs. Blazegraph is now available with GPU acceleration using our disruptive technology to accelerate data-parallel graph analytics and graph query. CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. On Fri, Nov 13, 2015 at 5:24 PM, Reul, Quentin < que...@wo...> wrote: > Hi Bryan, > > > > I have checked that there were not any duplicate triples in either > namespaces. > > > > For the query in the attached file, I should retrieve 21 records in my > result set (which I do when DISTINCT is used). However, I retrieve 79 > records in my result set when DISTINCT is not used. I have tried to use > the SERVICE part of the query as the first part of the query, but I > obtained the same results. > > > > Kind regards, > > > > *Quentin Reul* > > > > *From:* Bryan Thompson [mailto:br...@sy...] > *Sent:* Friday, November 13, 2015 4:01 PM > *To:* Reul, Quentin > *Cc:* big...@li... > *Subject:* Re: [Bigdata-developers] Duplicate in SELECT queries when > using SERVICE > > > > Quentin, > > > > Can you please look at the EXPLAIN of the query (if you are using the > workbench, this is a checkbox under the advanced options, otherwise you can > just add &explain to the query - see [1]). I am curious whether the triple > pattern is running before or after the SERVICE call. > > > > Can you please confirm that you are observing that behavior on endpoints > having just those two triples each? If there are many triples in the first > endpoint, then one possibility is that the SERVICE call is being invoked > more than one because multiple "chunks" of solutions are presented to it. > In this case, the join could well have duplicates (same ?concept in > different chunks leading to more than one solution with the same bindings > from the B endpoint) and a DISTINCT would be required. > > > > Another option would be to run the SERVICE first. This could be > accomplished with a query hint. See [2]. It is also possible to force > certain operations to run exactly once. For example, but pushing something > into a subquery and using the runOnce query hint. > > > > Another thing that you can try is to enable the solutions logger. This > logger provides details on the inputs and outputs of each operator. You > will be able to see the solutions produced by the triple pattern and those > produced by the SERVICE call for each invocation of those operators. This > will help you to understand whether multiple SERVICE invocations are > occurring. > > > > ## > > # Solutions trace (tab delimited file). Uncomment the next line to enable. > > #log4j.logger.com.bigdata.bop.engine.SolutionsLog=INFO,solutionsLog > > log4j.additivity.com.bigdata.bop.engine.SolutionsLog=false > > log4j.appender.solutionsLog=org.apache.log4j.ConsoleAppender > > #log4j.appender.solutionsLog=org.apache.log4j.FileAppender > > log4j.appender.solutionsLog.Threshold=ALL > > #log4j.appender.solutionsLog.File=solutions.csv > > #log4j.appender.solutionsLog.Append=true > > # I find that it is nicer to have this unbuffered since you can see what > > # is going on and to make sure that I have complete rule evaluation logs > > # on shutdown. > > #log4j.appender.solutionsLog.BufferedIO=false > > log4j.appender.solutionsLog.layout=org.apache.log4j.PatternLayout > > log4j.appender.solutionsLog.layout.ConversionPattern=SOLUTION:\t%m > > > > If your investigations do not suggest an obvious solution, then it might > be best if you create a ticket from this query and attach the EXPLAIN > (which is an html page) to that ticket. Please add both myself and Michael > Schmidt to the ticket as watchers so we will see any updates on the ticket. > > > > Thanks, > > Bryan > > > > [1] https://wiki.blazegraph.com/wiki/index.php/Explain#NSS_Explain_Mode > > [2] https://wiki.blazegraph.com/wiki/index.php/QueryHints > > > ---- > Bryan Thompson > > Chief Scientist & Founder > SYSTAP, LLC > > 4501 Tower Road > Greensboro, NC 27410 > > br...@sy... > > http://blazegraph.com > > http://blog.blazegraph.com > > > > Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > APIs. Blazegraph is now available with GPU acceleration using > our disruptive technology to accelerate data-parallel graph analytics and > graph query. > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are > for the sole use of the intended recipient(s) and are confidential or > proprietary to SYSTAP. Any unauthorized review, use, disclosure, > dissemination or copying of this email or its contents or attachments is > prohibited. If you have received this communication in error, please notify > the sender by reply email and permanently delete all copies of the email > and its contents and attachments. > > > > On Fri, Nov 13, 2015 at 11:35 AM, Reul, Quentin < > que...@wo...> wrote: > > Hi all, > > > > I'm encountering a weird behaviour when running SPARQL SELECT queries > including SERVICE in the WHERE clause. More specifically, I seem to > retrieve duplicate records in the result set when DISTINCT is not used. > We are using RemoteRepositoryManager to access the BlazeGraph 1.5.2 > instance and both namespaces are defined on the same machine. > > > > Let us imagine that I have the following triples in namespace A: > > <doc1> dcterms:subject <conceptB> > > <doc1> dcterms:title "Title of Document"^^xsd:string > > > > and some triples in namespace B: > > <conceptB> rdf:type skos:Concept > > <conceptB> skos:prefLabel "concept label"@en > > > > If I run the following SELECT query > > SELECT ?doc ?concept ?label > > WHERE { > > ?doc dcterms:subject ?concept . > > SERVICE <http:localhost:9999/bigdata/namespace/B/sparql> { > > ?concept skos:prefLabel ?label . > > } > > } > > then I would the following result set: > > || ?doc || ?concept || ?label || > > | <doc1> | <conceptB> | "concept label" | > > | <doc1> | <conceptB> | "concept label" | > > | <doc1> | <conceptB> | "concept label" | > > > > Interestingly, the number of duplicated records can change from processing > to processing. Is this something that other people have encountered? > > > > Kind regards, > > > > *Quentin Reul * > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Bigdata-developers mailing list > Big...@li... > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > > > |