Re: [dotNetRDF-bugs] Problems with SPARQL queries

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ah, I think I see what the problem is (well there's two in fact)

One is that the sub-query is getting scheduled too early in the query which
I have fixed

The other I have just found was likely introduced by a commit that went into
1.0.4 hence why I was asking if this was a regression from 1.0.3.  It
relates to algebra generation and means we're potentially executing the
graph clause too many times.  This is probably gonna be a little tricker to
fix but I will aim to have it fixed for 1.0.5 and try and get you a
pre-release build with a fix as soon as I can

Rob

From:  Tomek Pluskiewicz <to...@pl...>
Reply-To:  dotNetRDF Bug Report tracking and resolution
<dot...@li...>
Date:  Thursday, 22 May 2014 16:18
To:  dotNetRDF Bug Report tracking and resolution
<dot...@li...>
Subject:  Re: [dotNetRDF-bugs] Problems with SPARQL queries

> 
> I tried 1.0.4 and 1.0.5-pre2 and both are equally slow.
> 
> Tom
> 
> On May 22, 2014 4:40 PM, "Rob Vesse" <rv...@do...> wrote:
>> Tom
>> 
>> Are you saying that performance is substantially worse with 1.0.4 versus
>> 1.0.3 or the performance is just as bad across all recent releases?
>> 
>> Rob
>> 
>> From:  Tomasz Pluskiewicz <tom...@gm...>
>> Reply-To:  dotNetRDF Bug Report tracking and resolution
>> <dot...@li...>
>> Date:  Thursday, 22 May 2014 14:48
>> To:  dotNetRDF Bug Report tracking and resolution
>> <dot...@li...>
>> Subject:  Re: [dotNetRDF-bugs] Problems with SPARQL queries
>> 
>>> Rob, thanks for responding.
>>> 
>>> Always +1 for additional diagnostic tools (I mean the ExplainProcessor
>>> enhancement).
>>> 
>>> I've been fiddling with our query and the ?s ?p ?o pattern seems to have
>>> little but noticeable impact on the synthetic dataset. But indeed moving the
>>> subquery as-is outside the first GRAPH ?var boosts the query by an order of
>>> magnitude. I've also tried to remove the duplicate triple patterns on both
>>> GRAPH ?v patterns but it doesn't help much either. Interestingly a query
>>> which combines subquery moved, ?s ?p ?o extracted and duplicate triple
>>> patters removed is significantly slower then the one with just subquery
>>> moved outside the GRAPH ?var.
>>> 
>>> I've ran all kinds of queries against our real-life data (20k quads in over
>>> 900 graphs) and the conclusions are the same. Moving subquery and ?s ?p ?o
>>> graph pattern gives best results.
>>> 
>>> Regarding the ORDER BY it still seems like a bug. I wanted to blame
>>> inconsistent results on the fact that the subquery is nested inside the
>>> GRAPH ?var but with the subquery moved I observe the same bahaviour.
>>> 
>>> All the above is true for 1.0.3. Now regarding 1.0.4+ there are additional
>>> problems as I wrote yesterday. With the real-life data the original query
>>> takes over 2.5 minutes to complete, while in previous version only about a
>>> quarter of a second is needed! The optimized queries actually took so long
>>> that I never had them finished.
>>> 
>>> Tom
>>> 
>>> 
>>> On Wed, May 21, 2014 at 3:47 PM, Rob Vesse <rv...@do...> wrote:
>>>> Tom
>>>> 
>>>> Thanks for the report, I haven't done any debugging yet but I have a few
>>>> thoughts based on what you've described
>>>> 
>>>> ORDER BY causing indeterminate results could be a bug but it also could
>>>> just be an artefact of two things:
>>>> 1. SPARQL only defines a partial ordering so there are some combinations of
>>>> terms for which ordering is left to the implementation though since we're
>>>> just talking about dotNetRDF such indeterminate orderings should be defined
>>>> consistently
>>>> 2. That you have multiple terms in the data that compare to be equivalent,
>>>> in this case we're at the mercy of .Net's sort implementation for which
>>>> items float to the top and so are returned each time
>>>> GRAPH ?var can be quite expensive because what it does is evaluate the
>>>> inner operations over each individual named graph in the dataset in turn.
>>>> Where ?var is already bound this might be a small subset but given the
>>>> structure of your query I suspect there are at least some places where this
>>>> is happening.  So with two points in your query where you have GRAPH ?var
>>>> being potentially unbound (or bound to a large number of possible values)
>>>> you would get the O(n2) exponential scaling behaviour you describe
>>>> 
>>>> Also the ?s ?p ?o in the start of your first GRAPH clause may be causing a
>>>> substantial increase in intermediate results early on in the query.  It
>>>> might be better to have a separate GRAPH clause after the first GRAPH
>>>> clause to pull out all the triples once you've determined the graphs you
>>>> actually care about.
>>>> 
>>>> There is of course a possibility that dotNetRDF is optimising the query
>>>> badly but that will require some debugging to figure out if this is the
>>>> case.
>>>> 
>>>> Using the ExplainQueryProcessor
>>>> (http://www.dotnetrdf.org/api/index.asp?Topic=VDS.RDF.Query.ExplainQueryPro
>>>> cessor) with the ExplanationLevel turned up to Full as described at
>>>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/HowTo/Debug%20SPARQL%20Queri
>>>> es.wiki#!debugging-sparql-queries might be enlightening since it'll include
>>>> things like intermediate result count.  Though it doesn't currently analyse
>>>> how many graphs a given GRAPH clause has to consider which it'll make it
>>>> hard to spot that exponential looping on GRAPH ?var if that is the culprit,
>>>> that would certainly be interesting information so I may try and add that
>>>> in the future.
>>>> 
>>>> Let me know if you guys figure anything more out, I'll aim to take a proper
>>>> look and debug this later in the week
>>>> 
>>>> Cheers,
>>>> 
>>>> Rob
>>>> 
>>>> From:  Tomek Pluskiewicz <to...@pl...>
>>>> Reply-To:  dotNetRDF Bug Report tracking and resolution
>>>> <dot...@li...>
>>>> Date:  Wednesday, 21 May 2014 13:46
>>>> To:  dotNetRDF Bug Report tracking and resolution
>>>> <dot...@li...>
>>>> Subject:  Re: [dotNetRDF-bugs] Problems with SPARQL queries
>>>> 
>>>>> Also, here's a test repo https://bitbucket.org/tpluscode/sparql-test
>>>>> 
>>>>> 
>>>>> On Wed, May 21, 2014 at 2:18 PM, Tomek Pluskiewicz <to...@pl...>
>>>>> wrote:
>>>>>> Hi Rob
>>>>>> 
>>>>>> We've developing a ORM solution complete with Linq for some time now.
>>>>>> Will be open source'd at some point. Currently we've been experiencing
>>>>>> problems with query speed and reliability. Let me acquaint you with how
>>>>>> things work.
>>>>>> 
>>>>>> Each resource is contained within its own named graph and additionally
>>>>>> there is a meta-graph, which connects graphs and the described entities
>>>>>> (there could be many graphs for one resource). For example
>>>>>> 
>>>>>> # meta graph
>>>>>> <http://foo.com/productList/>
>>>>>> {
>>>>>>   ex:Wrench1 foaf:primaryTopic ex:Wrench1 .
>>>>>> }
>>>>>> 
>>>>>> # wrench
>>>>>> ex:Wrench1 { ex:Wrench1 a sch:Product ; sch:name "Wrench" . }
>>>>>> 
>>>>>> The problem is with a query
>>>>>> 
>>>>>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>>>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>>> PREFIX schema: <http://schema.org/>
>>>>>> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
>>>>>> 
>>>>>> SELECT ?s ?p ?o ?Gp0 ?p0
>>>>>> WHERE 
>>>>>> { 
>>>>>> GRAPH ?Gp0 
>>>>>> { 
>>>>>> ?s ?p ?o .
>>>>>> ?p0_sub schema:name ?name0_sub .
>>>>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string))
>>>>>> ?p0 rdf:type schema:Product .
>>>>>> {
>>>>>> SELECT DISTINCT ?p0_sub
>>>>>> WHERE 
>>>>>> {
>>>>>> GRAPH ?Gp0_sub
>>>>>> { 
>>>>>> ?p0_sub rdf:type schema:Product .
>>>>>> ?p0_sub schema:name ?name0_sub .
>>>>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string))
>>>>>> } 
>>>>>> GRAPH <http://foo.com/productList/>
>>>>>> {
>>>>>> ?Gp0_sub foaf:primaryTopic ?p0_sub .
>>>>>> } 
>>>>>> } 
>>>>>> #ORDER BY ?p0_sub
>>>>>> LIMIT 2 
>>>>>> }
>>>>>> FILTER(?p0_sub=?p0)
>>>>>> } 
>>>>>> 
>>>>>> GRAPH <http://foo.com/productList/>
>>>>>> { 
>>>>>> ?Gp0 foaf:primaryTopic ?p0 .
>>>>>> } 
>>>>>> }
>>>>>> 
>>>>>> transformed from the following Linq
>>>>>> 
>>>>>> Query<IProduct>().Where(p =>
>>>>>> p.Name.ToUpper().Contains(name.ToUpper())).Take(2)
>>>>>> 
>>>>>> There are two problems here. The query returns different results on
>>>>>> subsequent runs against the same dataset and it runs very slow.
>>>>>> Uncommenting the ORDER BY helps with the varying result count though I'm
>>>>>> not exactly sure why it should be necessary. However I'm not sure what's
>>>>>> with performance. Obviously it has something to do with the subquery but
>>>>>> I was unable to alter this SELECT so that it executed quickly. Even as
>>>>>> small a dataset as 9 quads (3 resources * (2 triples + 1 meta-triple))
>>>>>> takes 1 second to complete and the time seems to increase exponentially.
>>>>>> At 90 quads/30 graphs it is already taking close to 3 minutes.
>>>>>> 
>>>>>> We've first observed the performance problems with version 1.0.4 but with
>>>>>> a synthetic dataset the same issues arise in previous releases and
>>>>>> 1.0.5+. 
>>>>>> 
>>>>>> Hope you can help. Would you like any additional info?
>>>>>> 
>>>>>> Regards,
>>>>>> Tom
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> ---- "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
>>>>> FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get
>>>>> unparalleled scalability from the best Selenium testing platform available
>>>>> Simple to use. Nothing to install. Get started now for free."
>>>>> http://p.sf.net/sfu/SauceLabs_____________________________________________
>>>>> __ dotNetRDF-bugs mailing list
>>>>> dot...@li...://lists.sourceforge.net/lists/li
>>>>> stinfo/dotnetrdf-bugs
>>>> 
>>>> ---------------------------------------------------------------------------
>>>> ---
>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>>> Get unparalleled scalability from the best Selenium testing platform
>>>> available
>>>> Simple to use. Nothing to install. Get started now for free."
>>>> http://p.sf.net/sfu/SauceLabs
>>>> _______________________________________________
>>>> dotNetRDF-bugs mailing list
>>>> dot...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs
>>>> 
>>> 
>>> ----------------------------------------------------------------------------
>>> -- "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> Instantly run your Selenium tests across 300+ browser/OS combos. Get
>>> unparalleled scalability from the best Selenium testing platform available
>>> Simple to use. Nothing to install. Get started now for free."
>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>> dotNetRDF-bugs mailing list
>>> dot...@li...://lists.sourceforge.net/lists/list
>>> info/dotnetrdf-bugs
>> 
>> 
----------------------------------------------------------------------------->>
-
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform
>> available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> dotNetRDF-bugs mailing list
>> dot...@li...
>> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs
>> 
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos. Get
> unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> dotNetRDF-bugs mailing list dot...@li...
> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs