New BGP query handler for the Semantic Web Client Library reduces query times to a third

I have added a BGP query handler to my new in-memory storage solution for the Semantic Web Client Library (SWClLib). With this addition the execution of queries over a completely filled cache can be reduced to about 31.7% of the time required by the old Jena/NG4J based solution.

Details

A BGP query basically is a set of triple patterns; i.e., a set of RDF triples that may have query variables in the subject, predicate, and object position. A solution to a BGP query is a solution for all triple patterns in the query. Hence, each solution corresponds to a set of matching triples -one for each triple pattern- in the queried RDF dataset. In the context of NG4J and the SWClLib the matching triples may be part of different named graphs from the local graph set.
NG4J does not implement the evaluation of BGP queries, but, simply relies on the BGP query handler in Jena. The Jena BGP query handler executes the query by evaluating the triple patterns in an iterative manner. The iterators issue triple pattern queries -also called find(SPO) queries- to the underlying graph store. In case of NG4J the underlying graph store is a set of named graphs (i.e. an implementation of the NamedGraphSet interface) which is either the old NG4J implementation or my new storage solution. As described earlier, my solution is based on identifiers for RDF terms (e.g. URIs and literals) instead of the RDF terms itself. For this reason, the RDF term based triple pattern queries issued by the Jena iterators must be translated to identifier based triple patterns. This translation happens by looking-up the RDF terms in a dictionary. Furthermore, a retranslation is necessary for the solutions determined for the triple patterns. Since these translations take time I was wondering whether I can improve query execution by developing a custom BGP query handler for my identifier-based storage solution. This new query handler uses iterators which work with identifier-based triple patterns. This approach obsoletes the need for translations between RDF nodes and their identifiers and, thus, should reduce query execution times.

Evaluation

As usual, I used the Berlin SPARQL Benchmark (BSBM), together with my Linked Data like data generator, to evaluate the new approach. Notice, the data generator creates a dataset with a fairly large number of comparatively small RDF graphs which is typical for the local cache of the SWClLib. The following table lists the average (geometric mean) times to execute the BSBM query mix (10 runs and 3 additional warm-up runs) over datasets created with the scaling factors (pc) of 10 to 80. The table is accompanied with a chart that visualizes the measures.

Scaling factor: # of named graphs: Overall # of triples: Avg. query mix exec. times for Jena/NG4J-based store: Avg. query mix exec. times for new storage solution: Comparison:
10 613 4,971 2.96s 1.01s 34.2%
20 928 8,485 4.96s 1.56s 31.9%
30 1,245 11,999 7.43s 2.37s 31.9%
40 1,845 16,918 17.76s 5.36s 30.2%
50 2,599 22,616 47.93s 12.92s 27.0%
60 2,914 26,108 61.44s 23.93s 39.0%
70 3,230 29,601 80.11s 23.98s 29.9%
80 3,544 33,110 88.33s 26.08s 29.5%

measures
As can be seen from the measures, the new store with its new BGP query handler reduces the execution time of queries over a local graph set to about 31.7%. Honestly, I wasn’t really expecting such a huge improvement. Great news for the SWClLib and for SQUIN. Cheers!

Olaf

Tags: , ,

Leave a Reply

You must be logged in to post a comment.