Re: [Bigdata-developers] CSPO or SPOC?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Matt,

This notion of "primary" is a bit misleading.  There are two (well, maybe three) issues at stake.

First, whether we use an alternative set of covering indices, e.g., in order to have SCPO as an index.

Second, for high throughput writes in scale-out without transactional isolation, we can apply updates at the shards of the "primary" statement index and have eventually consistent updates at the secondary indices. In order for this to work we need to impose a constraint that all statements for a given "S" (or "C", depending on the application's information architecture) must be on the same shard in order to gain ACID guarantees without distributed locks for operations which inspect or update other statements for the same S (or C) during an update.

We currently maintain the following covering indices for the quads mode:
"SPOC",//
"POCS",//
"OCSP",//
"CSPO",//
"PCSO",//
"SOPC" //

For the SPOC index, the constraint that the "S" may not cross a shard boundary is useful if you believe that you might do validation during updates which cross contexts.  However it has less locality of reference within a context when compared with the CSPO index (maybe this is what you meant?  That CSPO is better for reading off all statements for a context?)

Likewise, if you believe that updates (and validation) would only occur within the same context, then the CSPO index with a "C" constraint would be sufficient.  However, we can not atomically assemble a view across different contexts for the same subject with CSPO (which begs the question of whether or not this is a requirement for anyone's information architecture).

Martyn's proposal of an SCPO index with an "S" constraint would be much the same as the SPOC index with an "S" constraint, except that it would have better locality within a given context.  However, we would need to identify a different set of covering indices which included SCPO if we went that route.

One other thread that is relevant here is the notion of partially denormalizing the RDF Values into an index used for star-joins and filters based on values.  I've been thinking about this in terms of denormalizing "small datatype values", e.g., not string literals, but xsd:int, xsd:long, xsd:float, xsd:double, etc.  The star join would operate against this index (which I have been assuming was also the primary index) and could directly decode the value of the tuple into the appropriate xsd datatype value for filters and to materialize attribute values without indirection through the lexicon for datatyped attributes.

It does seem like CSPO would do better for this last purpose than SPOC if your application is more likely to read within a given context (CSPO has better locality here) than to read within a given subject without regard to their context (SPOC has better locality here).

Concerning query performance, have you tried overriding SPOKeyOrder#isPrimaryKey() to return true for CSPO and do you observe a performance benefit for query?  Looking at the code, I see that there are a few hard coded assumptions that SPOC is the sole access path / primary index, but not that many (they appear to all be in SPORelation).  We could probably parameterize this with an Option for the AbstractTripleStore and move the tests for SPOKeyOrder#isPrimaryIndex() onto the AbstractRelation (which would affect SPOIndexRemover, SPOIndexWriter, SPOIndexWriteProc, and the AsynchronousStatementBufferFactory).  I can definately see how CSPO could work better for applications where context is king.

Bryan

________________________________
From: Matthew Roy [mailto:mr...@ca...]
Sent: Tuesday, February 23, 2010 11:47 AM
To: big...@li...
Subject: Re: [Bigdata-developers] CSPO or SPOC?

Coming from a system where the Context is the main unit of management for statements, CSPO feels like the correct primary index.  One question would be what effect on addition/deletion efficiency does the primary index make?  More specifically, if within a transactions additions/deletions usually occur with a high number of statements per context, does the proximity of the changed statements within the primary index help performance?

Matt

On 2/22/2010 6:38 PM, Bryan Thompson wrote:

I would like to solicit some input on the question of whether the primary index for the quad store should be SPOC (it is today) or CSPO.  There has been some discussion on this issue in the past.  I am raising the issue again in the light of discussions where an entire context corresponding to a relatively large collection of statements is to be dropped, e.g., wikipedia when mapped onto a single context, and when eventual consistency is being used for the secondary indices (that is, we handle conflict resolution on the primary statement index, e.g., SPOC, and then have a restart safe protocol guaranteeing eventual updates on the secondary statement indices).

I have come around to the opinion that mapping that much data onto a single context is generally wrong. The information would be more readily managed by mapping it onto a set of contexts corresponding to individual wikipedia entries, each of which was then associated with the source using statements about that context.

Thoughts?

Bryan
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bigdata-developers mailing list
Big...@li...<mailto:Big...@li...>
https://lists.sourceforge.net/lists/listinfo/bigdata-developers

Re: [Bigdata-developers] CSPO or SPOC?

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] CSPO or SPOC?