Re: [dotNetRDF-bugs] Problems with SPARQL queries

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Tom

Thanks for the report, I haven't done any debugging yet but I have a few
thoughts based on what you've described

ORDER BY causing indeterminate results could be a bug but it also could just
be an artefact of two things:
1. SPARQL only defines a partial ordering so there are some combinations of
terms for which ordering is left to the implementation though since we're
just talking about dotNetRDF such indeterminate orderings should be defined
consistently
2. That you have multiple terms in the data that compare to be equivalent,
in this case we're at the mercy of .Net's sort implementation for which
items float to the top and so are returned each time
GRAPH ?var can be quite expensive because what it does is evaluate the inner
operations over each individual named graph in the dataset in turn.  Where
?var is already bound this might be a small subset but given the structure
of your query I suspect there are at least some places where this is
happening.  So with two points in your query where you have GRAPH ?var being
potentially unbound (or bound to a large number of possible values) you
would get the O(n2) exponential scaling behaviour you describe

Also the ?s ?p ?o in the start of your first GRAPH clause may be causing a
substantial increase in intermediate results early on in the query.  It
might be better to have a separate GRAPH clause after the first GRAPH clause
to pull out all the triples once you've determined the graphs you actually
care about.

There is of course a possibility that dotNetRDF is optimising the query
badly but that will require some debugging to figure out if this is the
case.

Using the ExplainQueryProcessor
(http://www.dotnetrdf.org/api/index.asp?Topic=VDS.RDF.Query.ExplainQueryProc
essor) with the ExplanationLevel turned up to Full as described at
https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/HowTo/Debug%20SPARQL%20Querie
s.wiki#!debugging-sparql-queries might be enlightening since it'll include
things like intermediate result count.  Though it doesn't currently analyse
how many graphs a given GRAPH clause has to consider which it'll make it
hard to spot that exponential looping on GRAPH ?var if that is the culprit,
that would certainly be interesting information so I may try and add that in
the future.

Let me know if you guys figure anything more out, I'll aim to take a proper
look and debug this later in the week

Cheers,

Rob

From:  Tomek Pluskiewicz <to...@pl...>
Reply-To:  dotNetRDF Bug Report tracking and resolution
<dot...@li...>
Date:  Wednesday, 21 May 2014 13:46
To:  dotNetRDF Bug Report tracking and resolution
<dot...@li...>
Subject:  Re: [dotNetRDF-bugs] Problems with SPARQL queries

> Also, here's a test repo https://bitbucket.org/tpluscode/sparql-test
> 
> 
> On Wed, May 21, 2014 at 2:18 PM, Tomek Pluskiewicz <to...@pl...>
> wrote:
>> Hi Rob
>> 
>> We've developing a ORM solution complete with Linq for some time now. Will be
>> open source'd at some point. Currently we've been experiencing problems with
>> query speed and reliability. Let me acquaint you with how things work.
>> 
>> Each resource is contained within its own named graph and additionally there
>> is a meta-graph, which connects graphs and the described entities (there
>> could be many graphs for one resource). For example
>> 
>> # meta graph
>> <http://foo.com/productList/>
>> {
>>   ex:Wrench1 foaf:primaryTopic ex:Wrench1 .
>> }
>> 
>> # wrench
>> ex:Wrench1 { ex:Wrench1 a sch:Product ; sch:name "Wrench" . }
>> 
>> The problem is with a query
>> 
>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> PREFIX schema: <http://schema.org/>
>> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
>> 
>> SELECT ?s ?p ?o ?Gp0 ?p0
>> WHERE 
>> { 
>> GRAPH ?Gp0 
>> { 
>> ?s ?p ?o .
>> ?p0_sub schema:name ?name0_sub .
>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string))
>> ?p0 rdf:type schema:Product .
>> {
>> SELECT DISTINCT ?p0_sub
>> WHERE 
>> {
>> GRAPH ?Gp0_sub 
>> { 
>> ?p0_sub rdf:type schema:Product .
>> ?p0_sub schema:name ?name0_sub .
>> FILTER (CONTAINS(UCASE(?name0_sub),"W"^^xsd:string))
>> } 
>> GRAPH <http://foo.com/productList/>
>> {
>> ?Gp0_sub foaf:primaryTopic ?p0_sub .
>> } 
>> } 
>> #ORDER BY ?p0_sub
>> LIMIT 2 
>> }
>> FILTER(?p0_sub=?p0)
>> } 
>> 
>> GRAPH <http://foo.com/productList/>
>> { 
>> ?Gp0 foaf:primaryTopic ?p0 .
>> } 
>> }
>> 
>> transformed from the following Linq
>> 
>> Query<IProduct>().Where(p =>
>> p.Name.ToUpper().Contains(name.ToUpper())).Take(2)
>> 
>> There are two problems here. The query returns different results on
>> subsequent runs against the same dataset and it runs very slow. Uncommenting
>> the ORDER BY helps with the varying result count though I'm not exactly sure
>> why it should be necessary. However I'm not sure what's with performance.
>> Obviously it has something to do with the subquery but I was unable to alter
>> this SELECT so that it executed quickly. Even as small a dataset as 9 quads
>> (3 resources * (2 triples + 1 meta-triple)) takes 1 second to complete and
>> the time seems to increase exponentially. At 90 quads/30 graphs it is already
>> taking close to 3 minutes.
>> 
>> We've first observed the performance problems with version 1.0.4 but with a
>> synthetic dataset the same issues arise in previous releases and 1.0.5+.
>> 
>> Hope you can help. Would you like any additional info?
>> 
>> Regards,
>> Tom
> 
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos. Get
> unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> dotNetRDF-bugs mailing list dot...@li...
> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-bugs