Re: [Bigdata-developers] Duplicate in SELECT queries when using SERVICE

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Quentin,

> With regards to the query, is the use of hint:Prior hint:runFirst "true" in a query similar to having DISTINCT in theSELECT clause? I have tested the query with the added pattern and it returned the result set that I expected.

actually no: the query hint “runFirst” just tells the optimizer that the preceding construct (“hint:Prior”), namely the SERVICE, should be executed first — this should not change the semantics (respectively outcome) of the query, i.e. in both cases the DISTINCT keyword *should* not be required. Said otherwise: in case we run the SERVICE last there seems to be a problem in the query engine (which is what you reported), while the engine's behavior with SERVICE run first is correct. 

I’ve created a ticket at https://jira.blazegraph.com/browse/BLZG-1636 <https://jira.blazegraph.com/browse/BLZG-1636> describing the behavior and will have a look at it in the coming days. If you are able to share the data (or could provide a snippet of the data that allows to reproduce the problem) please let me know, this would help in debugging.

Best,
Michael

>  
> With regards to the runtime statistics, I have added a screenshot of it.
>  
> Thanks again for your help.
>  
> Quentin Reul
> Advanced Technology | Global Platform Organization | +1 (917) 891 5490 
> Email: Que...@wo... <mailto:Que...@wo...>
> Skype: quentin_reul
>  
> From: Michael Schmidt [mailto:ms...@me...] 
> Sent: Sunday, November 15, 2015 3:42 AM
> To: Reul, Quentin
> Cc: Bryan Thompson; big...@li...
> Subject: Re: [Bigdata-developers] Duplicate in SELECT queries when using SERVICE
>  
> Dear Quentin,
>  
> I’ve quickly set up you initial scenario with two triples in each namespace and was not able to reproduce the behavior, i.e. I always get one result there.
>  
> Looking at your query, it is much more complex. In fact, ?profileURI might be bound to the same URI multiple times when evaluating the SERVICE, which could lead to duplicates (@Bryan: it might be that we’re missing out a distinct projection here? Though this would not explain why results vary from time to time.). Could you try running the query *without* the SERVICE call and confirm that the number of results is stable? Also, how many results do you get then? And do you know whether there a skos:prefLabel for each of them? (just trying to nail down what’s going on)
>  
> Also, right at the end of the query plan in the EXPLAIN is a table showing runtime statistics (in particular, how many results were flowing through the operators). It would be quite useful to have that one too (both for the original query and the query without SERVICE). Could you share that as well, preferably as a screenshot.
>  
> Regarding the option to run the SERVICE first: as you mentioned already, it does not help just inverting the order — the optimizer makes its own decision, independently from the order in which you write things. What you need there is a query hint that forces the optimizer to run the SERVICE keyword first (see triple pattern in the last line, to be placed right after the SERVICE):
>  
> PREFIX dcterms: <http://purl.org/dc/terms/ <http://purl.org/dc/terms/>>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core# <http://www.w3.org/2004/02/skos/core>>
>  
> SELECT ?doc ?concept ?label
> WHERE {
>   ?doc dcterms:subject ?concept .
>   SERVICE <http://localhost:9999/bigdata/namespace/skos/sparql <http://localhost:9999/bigdata/namespace/skos/sparql>> {
>     ?concept skos:prefLabel ?label .
>   }
>   hint:Prior hint:runFirst "true" .
> }
>  
> However, using that hint for your query would mean extracting all "?profileStatusURI skos:prefLabel ?profileStatusLabel” patterns from the remote endpoint first (without any restriction), so from a performance perspective this might not be the best option. But I’d be interested in the result of this query, i.e. whether you still get duplicates.
>  
> Best,
> Michael
>  
> On 13 Nov 2015, at 23:24, Reul, Quentin <que...@wo... <mailto:que...@wo...>> wrote:
>  
> Hi Bryan,
>  
> I have checked that there were not any duplicate triples in either namespaces.
>  
> For the query in the attached file, I should retrieve 21 records in my result set (which I do when DISTINCT is used). However, I retrieve 79 records in my result set when DISTINCT is not used. I have tried to use the SERVICE part of the query as the first part of the query, but I obtained the same results.
>  
> Kind regards,
>  
> Quentin Reul
>  
> From: Bryan Thompson [mailto:br...@sy... <mailto:br...@sy...>] 
> Sent: Friday, November 13, 2015 4:01 PM
> To: Reul, Quentin
> Cc: big...@li... <mailto:big...@li...>
> Subject: Re: [Bigdata-developers] Duplicate in SELECT queries when using SERVICE
>  
> Quentin,
>  
> Can you please look at the EXPLAIN of the query (if you are using the workbench, this is a checkbox under the advanced options, otherwise you can just add &explain to the query - see [1]).  I am curious whether the triple pattern is running before or after the SERVICE call.  
>  
> Can you please confirm that you are observing that behavior on endpoints having just those two triples each?  If there are many triples in the first endpoint, then one possibility is that the SERVICE call is being invoked more than one because multiple "chunks" of solutions are presented to it.  In this case, the join could well have duplicates (same ?concept in different chunks leading to more than one solution with the same bindings from the B endpoint) and a DISTINCT would be required.  
>  
> Another option would be to run the SERVICE first. This could be accomplished with a query hint. See [2].  It is also possible to force certain operations to run exactly once.  For example, but pushing something into a subquery and using the runOnce query hint.
>  
> Another thing that you can try is to enable the solutions logger.  This logger provides details on the inputs and outputs of each operator. You will be able to see the solutions produced by the triple pattern and those produced by the SERVICE call for each invocation of those operators.  This will help you to understand whether multiple SERVICE invocations are occurring.
>  
> ## 
> # Solutions trace (tab delimited file).  Uncomment the next line to enable.
> #log4j.logger.com.bigdata.bop.engine.SolutionsLog=INFO,solutionsLog
> log4j.additivity.com.bigdata.bop.engine.SolutionsLog=false
> log4j.appender.solutionsLog=org.apache.log4j.ConsoleAppender
> #log4j.appender.solutionsLog=org.apache.log4j.FileAppender
> log4j.appender.solutionsLog.Threshold=ALL
> #log4j.appender.solutionsLog.File=solutions.csv
> #log4j.appender.solutionsLog.Append=true
> # I find that it is nicer to have this unbuffered since you can see what
> # is going on and to make sure that I have complete rule evaluation logs
> # on shutdown.
> #log4j.appender.solutionsLog.BufferedIO=false
> log4j.appender.solutionsLog.layout=org.apache.log4j.PatternLayout
> log4j.appender.solutionsLog.layout.ConversionPattern=SOLUTION:\t%m
>  
> If your investigations do not suggest an obvious solution, then it might be best if you create a ticket from this query and attach the EXPLAIN (which is an html page) to that ticket.  Please add both myself and Michael Schmidt to the ticket as watchers so we will see any updates on the ticket.
>  
> Thanks,
> Bryan
>  
> [1] https://wiki.blazegraph.com/wiki/index.php/Explain#NSS_Explain_Mode <https://wiki.blazegraph.com/wiki/index.php/Explain#NSS_Explain_Mode>
> [2] https://wiki.blazegraph.com/wiki/index.php/QueryHints <https://wiki.blazegraph.com/wiki/index.php/QueryHints>
> 
> ----
> Bryan Thompson
> Chief Scientist & Founder
> SYSTAP, LLC
> 4501 Tower Road
> Greensboro, NC 27410
> br...@sy... <mailto:br...@sy...>
> http://blazegraph.com <http://blazegraph.com/>
> http://blog.blazegraph.com <http://blog.blazegraph.com/>
>  
> Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints APIs.  Blazegraph is now available with GPU acceleration using our disruptive technology to accelerate data-parallel graph analytics and graph query.
> CONFIDENTIALITY NOTICE:  This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > 
>  
> On Fri, Nov 13, 2015 at 11:35 AM, Reul, Quentin <que...@wo... <mailto:que...@wo...>> wrote:
> Hi all,
>  
> I'm encountering a weird behaviour when running SPARQL SELECT queries including SERVICE in the WHERE clause. More specifically, I seem to retrieve duplicate records in the result set when DISTINCT is not used. We are using RemoteRepositoryManager to access the BlazeGraph 1.5.2 instance and both namespaces are defined on the same machine.
>  
> Let us imagine that I have the following triples in namespace A:
>   <doc1> dcterms:subject <conceptB>
>   <doc1> dcterms:title "Title of Document"^^xsd:string
>  
> and some triples in namespace B:
>   <conceptB> rdf:type skos:Concept
>   <conceptB> skos:prefLabel "concept label"@en
>  
> If I run the following SELECT query
>   SELECT ?doc ?concept ?label
>   WHERE {
>     ?doc dcterms:subject ?concept .
>     SERVICE <http:localhost:9999/bigdata/namespace/B/sparql> {
>       ?concept skos:prefLabel ?label .
>     }
>   }
> then I would the following result set:
> || ?doc   || ?concept  || ?label        ||
> | <doc1>  | <conceptB> | "concept label" |
> | <doc1>  | <conceptB> | "concept label" |
> | <doc1>  | <conceptB> | "concept label" |
>  
> Interestingly, the number of duplicated records can change from processing to processing. Is this something that other people have encountered?
>  
> Kind regards,
>  
> Quentin Reul
>  
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li... <mailto:Big...@li...>
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers <https://lists.sourceforge.net/lists/listinfo/bigdata-developers>
>  
> <sparql_query_explain.txt>------------------------------------------------------------------------------
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li... <mailto:Big...@li...>
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers <https://lists.sourceforge.net/lists/listinfo/bigdata-developers>
>  
> <BG - Runtime statistics on SERVICE.jpg>

Re: [Bigdata-developers] Duplicate in SELECT queries when using SERVICE

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] Duplicate in SELECT queries when using SERVICE