[Bigdata-developers] ALPP performance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This message is highlighting a high-level issue to do with ALPPs versus materialized versions of the same query.

yesterday I finished porting the final piece of the Syapse application's "normal user" functionality from our legacy knowledge base to bigdata.
This piece was the facetted browser - which has a heavy dependency on some typing functionality, partial queries that I was writing as

[A] ?object rdf:type / rdfs:subClassOf * ?class

(this is a very small part of a big query that populates every cell of a facetted browse page)

The performance of the initial cut was very significantly lower than the legacy system: I got a big boost by pulling in a recent change from Mike; but even so I was not in the right ball-park.

On analysis the issue seemed to come down to the rdfs:subClassOf * expressions, and I can meet my performance expectations by materializing the reflexive transitive closure of this property so that the query becomes

[B] ?object rdf:type / syapse:optimizedSubClassOf ?class

(approx: I got a factor of 10 from Mike's changes and a further factor of maybe 5 from materializing)

The architectural question is:

- should the ALPP code actually do a materialization (which would need to be invalidated on update), probably controlled by an optimization hint, or by counting (e.g. if we call rdfs:subClassOf * sufficiently frequently compared with the updates then we should materialize)

if it did, I imagine that the performance of the initial query [A] could approach that of the optimized query [B].

Arguments against (other than time and prioritization) are:
- this optimization is better done by the end user (as I am doing), where it can be guided by application knowledge (which is true for me - syapse:optimizedSubClassOf is strictly less than rdfs:subClassOf *, e.g. it is only reflexive on classes, and only on those classes that I care about in the sort of query I am supporting)
- the cache invalidation is also hard to get right in a general setting, whereas application level knowledge can make cache invalidation trivial (in the syapse application any change to the ontology is a pretty rare admin function, and we can invalidate all ontological caches for every change without any issue)

Arguments for are - this is otherwise an improvement that is conceptually straightforward

Jeremy J Carroll
Principal Architect
Syapse, Inc.

[Bigdata-developers] ALPP performance

Fast, scalable, robust graph database platform

[Bigdata-developers] ALPP performance