[Bigdata-developers] Improving locality and compression in the lexicon

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

All,

We should do something to improve our IRI compression in the reverse lexicon (ID2TERM) and potentially our IRI locality in the forward lexicon (TERM2ID).   

Right now things stand as follows:

TERM2ID: This index using front coding (prefix compression).  This works quite nicely.  We could improve locality for web graph data by transforming URLs such that the domain part of the URI looks like "com.bigdata.www" rather than "www.bigdata.com".  This would organize anything in "com.bigdata.blog" close to "com.bigdata.www" rather than close to "blog.foo.org".  This transformation would only exist in the TERM2ID keys.  The reverse index (ID2TERM) would store the untransformed URI.

ID2TERM: The term identifiers are not well correlated with the term types (literals, uris, etc).  The compression scheme here is also dead simple (basically, no compression).  We would do better if we moved the flag bits which indicate the type (literal, uri, bnode, or statement identifier) into the high order bits so most leaves would only have a single type of value, e.g., all literals, all uris, etc.  We could then do type-specific compression rather easily, handling URIs in one way, e.g., by segmenting them into a domain and a sequence of path names and coding those in a dictionary, etc.  Likewise, by moving those type flag bits into the high order bits of the term identifier, each shard (after the initial shard) could be constrained to have only a single kind of data (e.g., only URIs, only literals, etc).  That would probably improve access patterns as well.

Also, in terms of ID2TERM, there has been some off list discussion and we are inclined to introduce transparent "blobs" for long literals.  Obviously this has some bearing on compression techniques since we would only expect to find literals of modest length inline.

Right now, if you load an ontology into the system all URIs with the same prefix will be assigned term identifiers which are relatively close to one another and most probably dense.  This is even true in scale-out since term identifiers are assigned shard-wise by the TERM2ID index.  It might be that we could do more to exploit this fact.

Thoughts?

Bryan

[Bigdata-developers] Improving locality and compression in the lexicon

Fast, scalable, robust graph database platform

[Bigdata-developers] Improving locality and compression in the lexicon