|
From: Rob V. <rv...@do...> - 2014-05-22 15:53:50
|
Ah, I think I see what the problem is (well there's two in fact) One is that the sub-query is getting scheduled too early in the query which I have fixed The other I have just found was likely introduced by a commit that went into 1.0.4 hence why I was asking if this was a regression from 1.0.3. It relates to algebra generation and means we're potentially executing the graph clause too many times. This is probably gonna be a little tricker to fix but I will aim to have it fixed for 1.0.5 and try and get you a pre-release build with a fix as soon as I can Rob From: Tomek Pluskiewicz <to...@pl...> Reply-To: dotNetRDF Bug Report tracking and resolution <dot...@li...> Date: Thursday, 22 May 2014 16:18 To: dotNetRDF Bug Report tracking and resolution <dot...@li...> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries > > I tried 1.0.4 and 1.0.5-pre2 and both are equally slow. > > Tom > > On May 22, 2014 4:40 PM, "Rob Vesse" <rv...@do...> wrote: >> Tom >> >> Are you saying that performance is substantially worse with 1.0.4 versus >> 1.0.3 or the performance is just as bad across all recent releases? >> >> Rob >> >> From: Tomasz Pluskiewicz <tom...@gm...> >> Reply-To: dotNetRDF Bug Report tracking and resolution >> <dot...@li...> >> Date: Thursday, 22 May 2014 14:48 >> To: dotNetRDF Bug Report tracking and resolution >> <dot...@li...> >> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries >> >>> Rob, thanks for responding. >>> >>> Always +1 for additional diagnostic tools (I mean the ExplainProcessor >>> enhancement). >>> >>> I've been fiddling with our query and the ?s ?p ?o pattern seems to have >>> little but noticeable impact on the synthetic dataset. But indeed moving the >>> subquery as-is outside the first GRAPH ?var boosts the query by an order of >>> magnitude. I've also tried to remove the duplicate triple patterns on both >>> GRAPH ?v patterns but it doesn't help much either. Interestingly a query >>> which combines subquery moved, ?s ?p ?o extracted and duplicate triple >>> patters removed is significantly slower then the one with just subquery >>> moved outside the GRAPH ?var. >>> >>> I've ran all kinds of queries against our real-life data (20k quads in over >>> 900 graphs) and the conclusions are the same. Moving subquery and ?s ?p ?o >>> graph pattern gives best results. >>> >>> Regarding the ORDER BY it still seems like a bug. I wanted to blame >>> inconsistent results on the fact that the subquery is nested inside the >>> GRAPH ?var but with the subquery moved I observe the same bahaviour. >>> >>> All the above is true for 1.0.3. Now regarding 1.0.4+ there are additional >>> problems as I wrote yesterday. With the real-life data the original query >>> takes over 2.5 minutes to complete, while in previous version only about a >>> quarter of a second is needed! The optimized queries actually took so long >>> that I never had them finished. >>> >>> Tom >>> >>> >>> On Wed, May 21, 2014 at 3:47 PM, Rob Vesse <rv...@do...> wrote: >>>> Tom >>>> >>>> Thanks for the report, I haven't done any debugging yet but I have a few >>>> thoughts based on what you've described >>>> >>>> ORDER BY causing indeterminate results could be a bug but it also could >>>> just be an artefact of two things: >>>> 1. SPARQL only defines a partial ordering so there are some combinations of >>>> terms for which ordering is left to the implementation though since we're >>>> just talking about dotNetRDF such indeterminate orderings should be defined >>>> consistently >>>> 2. That you have multiple terms in the data that compare to be equivalent, >>>> in this case we're at the mercy of .Net's sort implementation for which >>>> items float to the top and so are returned each time >>>> GRAPH ?var can be quite expensive because what it does is evaluate the >>>> inner operations over each individual named graph in the dataset in turn. >>>> Where ?var is already bound this might be a small subset but given the >>>> structure of your query I suspect there are at least some places where this >>>> is happening. So with two points in your query where you have GRAPH ?var >>>> being potentially unbound (or bound to a large number of possible values) >>>> you would get the O(n2) exponential scaling behaviour you describe >>>> >>>> Also the ?s ?p ?o in the start of your first GRAPH clause may be causing a >>>> substantial increase in intermediate results early on in the query. It >>>> might be better to have a separate GRAPH clause after the first GRAPH >>>> clause to pull out all the triples once you've determined the graphs you >>>> actually care about. >>>> >>>> There is of course a possibility that dotNetRDF is optimising the query >>>> badly but that will require some debugging to figure out if this is the >>>> case. >>>> >>>> Using the ExplainQueryProcessor >>>> (http://www.dotnetrdf.org/api/index.asp?Topic=VDS.RDF.Query.ExplainQueryPro >>>> cessor) with the ExplanationLevel turned up to Full as described at >>>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/HowTo/Debug%20SPARQL%20Queri >>>> es.wiki#!debugging-sparql-queries might be enlightening since it'll include >>>> things like intermediate result count. Though it doesn't currently analyse >>>> how many graphs a given GRAPH clause has to consider which it'll make it >>>> hard to spot that exponential looping on GRAPH ?var if that is the culprit, >>>> that would certainly be interesting information so I may try and add that >>>> in the future. >>>> >>>> Let me know if you guys figure anything more out, I'll aim to take a proper >>>> look and debug this later in the week >>>> >>>> Cheers, >>>> >>>> Rob >>>> >>>> From: Tomek Pluskiewicz <to...@pl...> >>>> Reply-To: dotNetRDF Bug Report tracking and resolution >>>> <dot...@li...> >>>> Date: Wednesday, 21 May 2014 13:46 >>>> To: dotNetRDF Bug Report tracking and resolution >>>> <dot...@li...> >>>> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries >>>> >>>>> Also, here's a test repo https://bitbucket.org/tpluscode/sparql-test >>>>> >>>>> >>>>> On Wed, May 21, 2014 at 2:18 PM, Tomek Pluskiewicz <to...@pl...> >>>>> wrote: >>>>>> Hi Rob >>>>>> >>>>>> We've developing a ORM solution complete with Linq for some time now. >>>>>> Will be open source'd at some point. Currently we've been experiencing >>>>>> problems with query speed and reliability. Let me acquaint you with how >>>>>> things work. >>>>>> >>>>>> Each resource is contained within its own named graph and additionally >>>>>> there is a meta-graph, which connects graphs and the described entities >>>>>> (there could be many graphs for one resource). For example >>>>>> >>>>>> # meta graph >>>>>> <http://foo.com/productList/> >>>>>> { >>>>>> ex:Wrench1 foaf:primaryTopic ex:Wrench1 . >>>>>> } >>>>>> >>>>>> # wrench >>>>>> ex:Wrench1 { ex:Wrench1 a sch:Product ; sch:name "Wrench" . } >>>>>> >>>>>> The problem is with a query >>>>>> >>>>>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> >>>>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>>>>> PREFIX schema: <http://schema.org/> >>>>>> PREFIX foaf: <http://xmlns.com/foaf/0.1/> >>>>>> >>>>>> SELECT ?s ?p ?o ?Gp0 ?p0 >>>>>> WHERE >>>>>> { >>>>>> GRAPH ?Gp0 >>>>>> { >>>>>> ?s ?p ?o . >>>>>> ?p0_sub schema:name ?name0_sub . >>>>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >>>>>> ?p0 rdf:type schema:Product . >>>>>> { >>>>>> SELECT DISTINCT ?p0_sub >>>>>> WHERE >>>>>> { >>>>>> GRAPH ?Gp0_sub >>>>>> { >>>>>> ?p0_sub rdf:type schema:Product . >>>>>> ?p0_sub schema:name ?name0_sub . >>>>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >>>>>> } >>>>>> GRAPH <http://foo.com/productList/> >>>>>> { >>>>>> ?Gp0_sub foaf:primaryTopic ?p0_sub . >>>>>> } >>>>>> } >>>>>> #ORDER BY ?p0_sub >>>>>> LIMIT 2 >>>>>> } >>>>>> FILTER(?p0_sub=?p0) >>>>>> } >>>>>> >>>>>> GRAPH <http://foo.com/productList/> >>>>>> { >>>>>> ?Gp0 foaf:primaryTopic ?p0 . >>>>>> } >>>>>> } >>>>>> >>>>>> transformed from the following Linq >>>>>> >>>>>> Query<IProduct>().Where(p => >>>>>> p.Name.ToUpper().Contains(name.ToUpper())).Take(2) >>>>>> >>>>>> There are two problems here. The query returns different results on >>>>>> subsequent runs against the same dataset and it runs very slow. >>>>>> Uncommenting the ORDER BY helps with the varying result count though I'm >>>>>> not exactly sure why it should be necessary. However I'm not sure what's >>>>>> with performance. Obviously it has something to do with the subquery but >>>>>> I was unable to alter this SELECT so that it executed quickly. Even as >>>>>> small a dataset as 9 quads (3 resources * (2 triples + 1 meta-triple)) >>>>>> takes 1 second to complete and the time seems to increase exponentially. >>>>>> At 90 quads/30 graphs it is already taking close to 3 minutes. >>>>>> >>>>>> We've first observed the performance problems with version 1.0.4 but with >>>>>> a synthetic dataset the same issues arise in previous releases and >>>>>> 1.0.5+. >>>>>> >>>>>> Hope you can help. Would you like any additional info? >>>>>> >>>>>> Regards, >>>>>> Tom >>>>> >>>>> -------------------------------------------------------------------------- >>>>> ---- "Accelerate Dev Cycles with Automated Cross-Browser Testing - For >>>>> FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get >>>>> unparalleled scalability from the best Selenium testing platform available >>>>> Simple to use. Nothing to install. Get started now for free." >>>>> http://p.sf.net/sfu/SauceLabs_____________________________________________ >>>>> __ dotNetRDF-bugs mailing list >>>>> dot...@li...://lists.sourceforge.net/lists/li >>>>> stinfo/dotnetrdf-bugs >>>> >>>> --------------------------------------------------------------------------- >>>> --- >>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >>>> Instantly run your Selenium tests across 300+ browser/OS combos. >>>> Get unparalleled scalability from the best Selenium testing platform >>>> available >>>> Simple to use. Nothing to install. Get started now for free." >>>> http://p.sf.net/sfu/SauceLabs >>>> _______________________________________________ >>>> dotNetRDF-bugs mailing list >>>> dot...@li... >>>> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >>>> >>> >>> ---------------------------------------------------------------------------- >>> -- "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >>> Instantly run your Selenium tests across 300+ browser/OS combos. Get >>> unparalleled scalability from the best Selenium testing platform available >>> Simple to use. Nothing to install. Get started now for free." >>> http://p.sf.net/sfu/SauceLabs_______________________________________________ >>> dotNetRDF-bugs mailing list >>> dot...@li...://lists.sourceforge.net/lists/list >>> info/dotnetrdf-bugs >> >> ----------------------------------------------------------------------------->> - >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> Instantly run your Selenium tests across 300+ browser/OS combos. >> Get unparalleled scalability from the best Selenium testing platform >> available >> Simple to use. Nothing to install. Get started now for free." >> http://p.sf.net/sfu/SauceLabs >> _______________________________________________ >> dotNetRDF-bugs mailing list >> dot...@li... >> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >> > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. Get > unparalleled scalability from the best Selenium testing platform available > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs_______________________________________________ > dotNetRDF-bugs mailing list dot...@li... > https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs |