Re: [dotNetRDF-Develop] Spaqrl Query Performance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Adam

Comments inline:

From:  Fedja Adam <ad...@ad...>
Reply-To:  dotNetRDF Developer Discussion and Feature Request
<dot...@li...>
Date:  Tuesday, 3 March 2015 10:45
To:  <dot...@li...>
Subject:  [dotNetRDF-Develop] Spaqrl Query Performance

>     
>  Hello dotNetRDF Team,
>  
>  I'm quite new to your library and RDF as well, and I have run into some
> performance issues I don't seem to be able to solve by myself. Not being sure
> whether this is a problem on my side or simply an algorithmic or
> implementation issue, I'm writing you for some feedback and / or help.
>  
>  The setup is reasonably simple: I'm using an InMemoryDataset consisting of
> (potentially multiple, but in this case a single) Graph(s). The dataset is
> very small (73 lines of Turtle)

Provided a sample dataset (with data redacted/obfustucated as necessary) is
helpful if you'd like us to investigate further if our other comments don't
help

> and the queries I'm performing shouldn't be too complex either. This is the
> code I'm using for querying:
>  
>> SparqlParameterizedString queryString = new SparqlParameterizedString();
>>  queryString.CommandText = query;
>>  // []Adding namespaces here]
>>  // [Setting parameters here]
>>  SparqlQuery sparqlQuery = this.parser.ParseFromString(queryString);
>>  SparqlResultSet resultSet = this.processor.ProcessQuery(sparqlQuery) as
>> SparqlResultSet;
>>  
>  I'm using a SparqlQueryParser and a LeviathanQueryProcessor. Everything
> happens locally on my machine, no web stuff involved. The problem is that a
> single ProcessQuery call takes about 5 - 7 ms,

Is this running under the debugger?

Under the debugger the observed performance can be orders of magnitude
worse, please make sure you are taking any timings with a release build with
no debugger attached

>  which is too much for my purposes. I need to perform a lot of differently
> parameterized queries in a row.
>  
>  Is there any way I could improve performance? Calling "Optimize" on the query
> before executing doesn't seem to have an effect. A representative example
> query is this one:

The parser automatically calls Optimize (unless you've disabled
optimisation) when it finishes parsing a query so calling it again will be a
no-op

> 
>  
>> SELECT ?obj
>>  WHERE
>>  {
>>    ?obj Knowledge:IsA* ?actor .
Do you actually need to use property paths here (the * syntax)?

Property paths are expensive to evaluate especially arbitrary length paths
like * (zero or more).  Note that using * will potentially bind all triples
with that predicate in the data (depends on the order in which the engine
evaluates the matches) so if you do need property paths using + (at least
one step) is typically better though it won't be as fast as avoiding
property paths altogether.

If the nodes or interest are directly connected to each other by a single
instance of the Knowledge:IsA predicate then omit the */+.   If it will be
connected within a limited number of hops consider using the {n,m} syntax
instead as that can be evaluated more efficiently.

>> 
>>    ?actor Knowledge:HasAttribute Knowledge:Actor .
>>    ?obj Knowledge:IsA* ?prey .
Same comment as above applies to the use of property paths here

>> 
>>    @MainActor Knowledge:PredatorOf ?prey .
>>    MINUS
>>    {
>>      ?obj Knowledge:HasAttribute Knowledge:Abstract .
>>    }
>>    FILTER (!sameTerm(?obj, @MainActor))
>>  }
>>  
>  If you spot any wild problems in the query itself, let me know.

Use of property paths are the only obvious concern without knowing more
about your data

> 
>  
>  I've also looked at some parts of the RDF Querying code and it seems like
> there is some kind of Algebra evaluation using classes - is there maybe a way
> to "Compile" them similar to C# Expression trees in order to improve
> performance?

Well a query is "compiled" in a sense to an algebra (which is the formal
representation of the query) but our engine does not do any kind of caching
of the algebra as a query plan as a traditional RDBMS might do.  If you
really wanted to you could modify the algebra once you have it to substitute
your parameters in that way.  However that is not for the faint of heart nor
would it necessarily yield any performance improvements since property paths
are likely the biggest single factor in the execution time.

Rob

> 
>  
>  Regards,
>  Adam
>  
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now.
> http://goparallel.sourceforge.net/____________________________________________
> ___ dotNetRDF-develop mailing list dot...@li...
> https://lists.sourceforge.net/lists/listinfo/dotnetrdf-develop