Reconsidering the evaluation of my new storage solution for the Semantic Web Client Library (SWClLib) I realized that the graph set generated by the Berlin SPARQL Benchmark (BSBM) does not adequately represent the set of graphs retrieved by the SWClLib. BSBM generates a named graph based dataset that contains a graph for every data publisher (e.g. producer, vendor, rating site); each of these graphs contains all data from the corresponding publisher. In contrast, linked data servers that publish a dataset usually slice the provided dataset in order to respond with small subsets to URI-based requests; each of these subsets contains only those data items that describe the resource identified by the requested URI. Thus, in most cases, the RDF graphs retrieved by looking up URIs in SWClLib contain only a few triples. Furthermore, since SWClLib usually looks up many URIs during the evaluation of a query the queried graph set finally contains a lot of these fairly small RDF graphs. The graph set generated by BSBM, in contrast, comprises very few RDF graphs (one for each publisher) where each graph contains comparably many triples (all data from the corresponding publisher). For this reason, I decided to develop a data generator that creates a set of graphs which is more similar to the graph set usually queried by the SWClLib.
The new data generator is based on the BSBM data generator. In fact, the new generator slices a BSBM-created dataset (the non-named graphs version) as if it would provide the dataset as linked data. Thus, the graph set created by the new generator contains single RDF graphs for each product type, product feature, producer, product, vendor, offer, person, and review in the BSBM-generated dataset. To slice the dataset the new generator simply queries the given dataset for all relevant resources, issues a DESCRIBE query for each of these resources, and adds the graphs resulting from the DESCRIBE queries to a set of named graphs.
You can download the source file for the new data generator.
Using data created with the new generator I was able to repeat the evaluation of the new storage solution under more realistic conditions. The following tables list i) the average (geometric mean) times to execute the BSBM query mix (10 runs and 3 additional warm-up runs) and ii) the sizes of the graph sets used for the evaluation, respectively.
| Scaling factor: | # of named graphs: | Overall # of triples: | Avg. query mix exec. times for Jena/NG4J-based store: | Avg. query mix exec. times for new storage solution: | Comparison: |
| 10 | 613 | 4,971 | 3.272955s | 2.437405s | 74.5% |
| 20 | 928 | 8,485 | 5.396082s | 4.014113s | 74.4% |
| 30 | 1,245 | 11,999 | 7.656801s | 6.098843s | 79.7% |
| 40 | 1,845 | 16,918 | 18.648374s | 14.892227s | 79.9% |
| 50 | 2,599 | 22,616 | 53.226853s | 40.460950s | 76.0% |
| 60 | 2,914 | 26,108 | 62.611587s | 49.309202s | 78.8% |
| Scaling factor: | Named graph sets size in bytes for Jena/NG4J-based store: | Named graph sets size in bytes for new storage solution: | Comparison: |
| 10 | 4,505,292 | 36,196,383 | 803.4% |
| 20 | 7,647,495 | 55,421,109 | 724.7% |
| 30 | 10,809,110 | 74,738,347 | 691.4% |
| 40 | 15,972,110 | 110,337,845 | 690.8% |
| 50 | 21,814,179 | 154,492,554 | 708.2% |
| 60 | 25,164,175 | 173,915,141 | 691.1% |
These measures indicate that the new SWClLib storage solution reduces the execution time of queries to about 77% for graph sets with linked data characteristics. Even if this reduction is not as striking as the 67% measured during the previous evaluation it is still a significant improvement. Unfortunately, the additional memory requirements of about 718% are unacceptable. As already mentioned in the previous post the size of memory that is required is dominated by the number of buckets in the hashtables. Currently, each hashtable consists of 2048 buckets. For fairly small graphs (with at most a few hundred triples) the majority of the hashtable buckets remain empty. Unfortunately, buckets -even if they are empty- require space. Thus, 2048 buckets per hashtable are far too much for the small RDF graphs typically retrieved by SWClLib (and therefore created by the new data generator). For this reason, I will reduce the number of buckets in a follow-up experiment. I will try to find an optimal number that enables a maximum reduction of required memory and that, by the same time, retains the performance improvements.
Olaf
[...] usual, I used the Berlin SPARQL Benchmark (BSBM), together with my Linked Data like data generator, to evaluate the new approach. Notice, the data generator creates a dataset with a fairly large [...]