|
From: Tomek P. <to...@pl...> - 2014-05-23 12:50:09
|
Thanks. I'm always equally impressed with the speed and efficiency! Any idea though why the ORDER BY is required for the query to return correct results reliably? We're good with 1.0.3 for now so you need not rush. Cheers, Tom On May 22, 2014 5:53 PM, "Rob Vesse" <rv...@do...> wrote: > > Ah, I think I see what the problem is (well there's two in fact) > > One is that the sub-query is getting scheduled too early in the query which I have fixed > > The other I have just found was likely introduced by a commit that went into 1.0.4 hence why I was asking if this was a regression from 1.0.3. It relates to algebra generation and means we're potentially executing the graph clause too many times. This is probably gonna be a little tricker to fix but I will aim to have it fixed for 1.0.5 and try and get you a pre-release build with a fix as soon as I can > > Rob > > From: Tomek Pluskiewicz <to...@pl...> > Reply-To: dotNetRDF Bug Report tracking and resolution <dot...@li...> > Date: Thursday, 22 May 2014 16:18 > To: dotNetRDF Bug Report tracking and resolution <dot...@li...> > Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries > > I tried 1.0.4 and 1.0.5-pre2 and both are equally slow. > > Tom > > On May 22, 2014 4:40 PM, "Rob Vesse" <rv...@do...> wrote: >> >> Tom >> >> Are you saying that performance is substantially worse with 1.0.4 versus 1.0.3 or the performance is just as bad across all recent releases? >> >> Rob >> >> From: Tomasz Pluskiewicz <tom...@gm...> >> Reply-To: dotNetRDF Bug Report tracking and resolution <dot...@li...> >> Date: Thursday, 22 May 2014 14:48 >> To: dotNetRDF Bug Report tracking and resolution <dot...@li...> >> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries >> >> Rob, thanks for responding. >> >> Always +1 for additional diagnostic tools (I mean the ExplainProcessor enhancement). >> >> I've been fiddling with our query and the ?s ?p ?o pattern seems to have little but noticeable impact on the synthetic dataset. But indeed moving the subquery as-is outside the first GRAPH ?var boosts the query by an order of magnitude. I've also tried to remove the duplicate triple patterns on both GRAPH ?v patterns but it doesn't help much either. Interestingly a query which combines subquery moved, ?s ?p ?o extracted and duplicate triple patters removed is significantly slower then the one with just subquery moved outside the GRAPH ?var. >> >> I've ran all kinds of queries against our real-life data (20k quads in over 900 graphs) and the conclusions are the same. Moving subquery and ?s ?p ?o graph pattern gives best results. >> >> Regarding the ORDER BY it still seems like a bug. I wanted to blame inconsistent results on the fact that the subquery is nested inside the GRAPH ?var but with the subquery moved I observe the same bahaviour. >> >> All the above is true for 1.0.3. Now regarding 1.0.4+ there are additional problems as I wrote yesterday. With the real-life data the original query takes over 2.5 minutes to complete, while in previous version only about a quarter of a second is needed! The optimized queries actually took so long that I never had them finished. >> >> Tom >> >> >> On Wed, May 21, 2014 at 3:47 PM, Rob Vesse <rv...@do...> wrote: >>> >>> Tom >>> >>> Thanks for the report, I haven't done any debugging yet but I have a few thoughts based on what you've described >>> >>> ORDER BY causing indeterminate results could be a bug but it also could just be an artefact of two things: >>> >>> SPARQL only defines a partial ordering so there are some combinations of terms for which ordering is left to the implementation though since we're just talking about dotNetRDF such indeterminate orderings should be defined consistently >>> That you have multiple terms in the data that compare to be equivalent, in this case we're at the mercy of .Net's sort implementation for which items float to the top and so are returned each time >>> >>> GRAPH ?var can be quite expensive because what it does is evaluate the inner operations over each individual named graph in the dataset in turn. Where ?var is already bound this might be a small subset but given the structure of your query I suspect there are at least some places where this is happening. So with two points in your query where you have GRAPH ?var being potentially unbound (or bound to a large number of possible values) you would get the O(n2) exponential scaling behaviour you describe >>> >>> Also the ?s ?p ?o in the start of your first GRAPH clause may be causing a substantial increase in intermediate results early on in the query. It might be better to have a separate GRAPH clause after the first GRAPH clause to pull out all the triples once you've determined the graphs you actually care about. >>> >>> There is of course a possibility that dotNetRDF is optimising the query badly but that will require some debugging to figure out if this is the case. >>> >>> Using the ExplainQueryProcessor (http://www.dotnetrdf.org/api/index.asp?Topic=VDS.RDF.Query.ExplainQueryProcessor) with the ExplanationLevel turned up to Full as described at https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/HowTo/Debug%20SPARQL%20Queries.wiki#!debugging-sparql-queries might be enlightening since it'll include things like intermediate result count. Though it doesn't currently analyse how many graphs a given GRAPH clause has to consider which it'll make it hard to spot that exponential looping on GRAPH ?var if that is the culprit, that would certainly be interesting information so I may try and add that in the future. >>> >>> Let me know if you guys figure anything more out, I'll aim to take a proper look and debug this later in the week >>> >>> Cheers, >>> >>> Rob >>> >>> From: Tomek Pluskiewicz <to...@pl...> >>> Reply-To: dotNetRDF Bug Report tracking and resolution <dot...@li...> >>> Date: Wednesday, 21 May 2014 13:46 >>> To: dotNetRDF Bug Report tracking and resolution <dot...@li...> >>> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries >>> >>> Also, here's a test repo https://bitbucket.org/tpluscode/sparql-test >>> >>> >>> On Wed, May 21, 2014 at 2:18 PM, Tomek Pluskiewicz <to...@pl...> wrote: >>>> >>>> Hi Rob >>>> >>>> We've developing a ORM solution complete with Linq for some time now. Will be open source'd at some point. Currently we've been experiencing problems with query speed and reliability. Let me acquaint you with how things work. >>>> >>>> Each resource is contained within its own named graph and additionally there is a meta-graph, which connects graphs and the described entities (there could be many graphs for one resource). For example >>>> >>>> # meta graph >>>> <http://foo.com/productList/> >>>> { >>>> ex:Wrench1 foaf:primaryTopic ex:Wrench1 . >>>> } >>>> >>>> # wrench >>>> ex:Wrench1 { ex:Wrench1 a sch:Product ; sch:name "Wrench" . } >>>> >>>> The problem is with a query >>>> >>>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>>> PREFIX schema: <http://schema.org/> >>>> PREFIX foaf: <http://xmlns.com/foaf/0.1/> >>>> >>>> SELECT ?s ?p ?o ?Gp0 ?p0 >>>> WHERE >>>> { >>>> GRAPH ?Gp0 >>>> { >>>> ?s ?p ?o . >>>> ?p0_sub schema:name ?name0_sub . >>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >>>> ?p0 rdf:type schema:Product . >>>> { >>>> SELECT DISTINCT ?p0_sub >>>> WHERE >>>> { >>>> GRAPH ?Gp0_sub >>>> { >>>> ?p0_sub rdf:type schema:Product . >>>> ?p0_sub schema:name ?name0_sub . >>>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >>>> } >>>> GRAPH <http://foo.com/productList/> >>>> { >>>> ?Gp0_sub foaf:primaryTopic ?p0_sub . >>>> } >>>> } >>>> #ORDER BY ?p0_sub >>>> LIMIT 2 >>>> } >>>> FILTER(?p0_sub=?p0) >>>> } >>>> >>>> GRAPH <http://foo.com/productList/> >>>> { >>>> ?Gp0 foaf:primaryTopic ?p0 . >>>> } >>>> } >>>> >>>> transformed from the following Linq >>>> >>>> Query<IProduct>().Where(p => p.Name.ToUpper().Contains(name.ToUpper())).Take(2) >>>> >>>> There are two problems here. The query returns different results on subsequent runs against the same dataset and it runs very slow. Uncommenting the ORDER BY helps with the varying result count though I'm not exactly sure why it should be necessary. However I'm not sure what's with performance. Obviously it has something to do with the subquery but I was unable to alter this SELECT so that it executed quickly. Even as small a dataset as 9 quads (3 resources * (2 triples + 1 meta-triple)) takes 1 second to complete and the time seems to increase exponentially. At 90 quads/30 graphs it is already taking close to 3 minutes. >>>> >>>> We've first observed the performance problems with version 1.0.4 but with a synthetic dataset the same issues arise in previous releases and 1.0.5+. >>>> >>>> Hope you can help. Would you like any additional info? >>>> >>>> Regards, >>>> Tom >>> >>> >>> ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs_______________________________________________ dotNetRDF-bugs mailing list dot...@li...://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >>> >>> >>> ------------------------------------------------------------------------------ >>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >>> Instantly run your Selenium tests across 300+ browser/OS combos. >>> Get unparalleled scalability from the best Selenium testing platform available >>> Simple to use. Nothing to install. Get started now for free." >>> http://p.sf.net/sfu/SauceLabs >>> _______________________________________________ >>> dotNetRDF-bugs mailing list >>> dot...@li... >>> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >>> >> >> ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs_______________________________________________ dotNetRDF-bugs mailing list dot...@li...://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >> >> >> ------------------------------------------------------------------------------ >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> Instantly run your Selenium tests across 300+ browser/OS combos. >> Get unparalleled scalability from the best Selenium testing platform available >> Simple to use. Nothing to install. Get started now for free." >> http://p.sf.net/sfu/SauceLabs >> _______________________________________________ >> dotNetRDF-bugs mailing list >> dot...@li... >> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs >> > ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs_______________________________________________ dotNetRDF-bugs mailing list dot...@li... https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs > > > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. > Get unparalleled scalability from the best Selenium testing platform available > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs > _______________________________________________ > dotNetRDF-bugs mailing list > dot...@li... > https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs > |