|
From: Rob V. <rv...@do...> - 2014-05-21 14:04:38
|
Tom Thanks for the report, I haven't done any debugging yet but I have a few thoughts based on what you've described ORDER BY causing indeterminate results could be a bug but it also could just be an artefact of two things: 1. SPARQL only defines a partial ordering so there are some combinations of terms for which ordering is left to the implementation though since we're just talking about dotNetRDF such indeterminate orderings should be defined consistently 2. That you have multiple terms in the data that compare to be equivalent, in this case we're at the mercy of .Net's sort implementation for which items float to the top and so are returned each time GRAPH ?var can be quite expensive because what it does is evaluate the inner operations over each individual named graph in the dataset in turn. Where ?var is already bound this might be a small subset but given the structure of your query I suspect there are at least some places where this is happening. So with two points in your query where you have GRAPH ?var being potentially unbound (or bound to a large number of possible values) you would get the O(n2) exponential scaling behaviour you describe Also the ?s ?p ?o in the start of your first GRAPH clause may be causing a substantial increase in intermediate results early on in the query. It might be better to have a separate GRAPH clause after the first GRAPH clause to pull out all the triples once you've determined the graphs you actually care about. There is of course a possibility that dotNetRDF is optimising the query badly but that will require some debugging to figure out if this is the case. Using the ExplainQueryProcessor (http://www.dotnetrdf.org/api/index.asp?Topic=VDS.RDF.Query.ExplainQueryProc essor) with the ExplanationLevel turned up to Full as described at https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/HowTo/Debug%20SPARQL%20Querie s.wiki#!debugging-sparql-queries might be enlightening since it'll include things like intermediate result count. Though it doesn't currently analyse how many graphs a given GRAPH clause has to consider which it'll make it hard to spot that exponential looping on GRAPH ?var if that is the culprit, that would certainly be interesting information so I may try and add that in the future. Let me know if you guys figure anything more out, I'll aim to take a proper look and debug this later in the week Cheers, Rob From: Tomek Pluskiewicz <to...@pl...> Reply-To: dotNetRDF Bug Report tracking and resolution <dot...@li...> Date: Wednesday, 21 May 2014 13:46 To: dotNetRDF Bug Report tracking and resolution <dot...@li...> Subject: Re: [dotNetRDF-bugs] Problems with SPARQL queries > Also, here's a test repo https://bitbucket.org/tpluscode/sparql-test > > > On Wed, May 21, 2014 at 2:18 PM, Tomek Pluskiewicz <to...@pl...> > wrote: >> Hi Rob >> >> We've developing a ORM solution complete with Linq for some time now. Will be >> open source'd at some point. Currently we've been experiencing problems with >> query speed and reliability. Let me acquaint you with how things work. >> >> Each resource is contained within its own named graph and additionally there >> is a meta-graph, which connects graphs and the described entities (there >> could be many graphs for one resource). For example >> >> # meta graph >> <http://foo.com/productList/> >> { >> ex:Wrench1 foaf:primaryTopic ex:Wrench1 . >> } >> >> # wrench >> ex:Wrench1 { ex:Wrench1 a sch:Product ; sch:name "Wrench" . } >> >> The problem is with a query >> >> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> >> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >> PREFIX schema: <http://schema.org/> >> PREFIX foaf: <http://xmlns.com/foaf/0.1/> >> >> SELECT ?s ?p ?o ?Gp0 ?p0 >> WHERE >> { >> GRAPH ?Gp0 >> { >> ?s ?p ?o . >> ?p0_sub schema:name ?name0_sub . >> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >> ?p0 rdf:type schema:Product . >> { >> SELECT DISTINCT ?p0_sub >> WHERE >> { >> GRAPH ?Gp0_sub >> { >> ?p0_sub rdf:type schema:Product . >> ?p0_sub schema:name ?name0_sub . >> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string)) >> } >> GRAPH <http://foo.com/productList/> >> { >> ?Gp0_sub foaf:primaryTopic ?p0_sub . >> } >> } >> #ORDER BY ?p0_sub >> LIMIT 2 >> } >> FILTER(?p0_sub=?p0) >> } >> >> GRAPH <http://foo.com/productList/> >> { >> ?Gp0 foaf:primaryTopic ?p0 . >> } >> } >> >> transformed from the following Linq >> >> Query<IProduct>().Where(p => >> p.Name.ToUpper().Contains(name.ToUpper())).Take(2) >> >> There are two problems here. The query returns different results on >> subsequent runs against the same dataset and it runs very slow. Uncommenting >> the ORDER BY helps with the varying result count though I'm not exactly sure >> why it should be necessary. However I'm not sure what's with performance. >> Obviously it has something to do with the subquery but I was unable to alter >> this SELECT so that it executed quickly. Even as small a dataset as 9 quads >> (3 resources * (2 triples + 1 meta-triple)) takes 1 second to complete and >> the time seems to increase exponentially. At 90 quads/30 graphs it is already >> taking close to 3 minutes. >> >> We've first observed the performance problems with version 1.0.4 but with a >> synthetic dataset the same issues arise in previous releases and 1.0.5+. >> >> Hope you can help. Would you like any additional info? >> >> Regards, >> Tom > > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. Get > unparalleled scalability from the best Selenium testing platform available > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs_______________________________________________ > dotNetRDF-bugs mailing list dot...@li... > https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs |