I'm going to push to have our GSOC email discussions in the dspace-
developer list from now on so that we have a thread history we can
refer to in our WIKI pages etc. I'd recommend using a [gsoc-...]
prefix on your emails and will document I've tried to detail this for
each separate project here...
This particular thread is a brainstorm on the DSpace relational
Database tables and their various structure in relation to tables
structures commonly found in relational based triple-stores.
Feel free to comment...
Begin forwarded message:
> From: "Peter Coetzee" <peter@...>
> Date: May 21, 2008 4:47:14 PM PDT
> To: "Mark Diggory" <mdiggory@...>
> Subject: Re: http://www.w3.org/2007/03/RdfRDB/papers/d2rq-
> On Thu, May 22, 2008 at 12:11 AM, Mark Diggory <mdiggory@...>
>> On May 21, 2008, at 3:29 PM, Peter Coetzee wrote:
>>> Hi Mark,
>>> Thanks for sending this through - it's definitely a potential
>>> alternative way to implement the database layer; particularly as it
>>> would let us leave existing installation's databases untouched. The
>>> more I think about it, the better it seems, actually. I'll add
>>> this to
>>> my list of benchmarks to explore; it'll be interesting to see how it
>>> performs compared to Sesame and the various Jena triple stores. Is
>>> there a large-ish dump of DSpace data available somewhere I can load
>>> up locally for testing (I'd rather test it with DSpace's data, so
>>> my tests better reflect the performance we'll perceive!), or shall I
>>> just generate some database contents synthetically?
>> I can get you a copy of the dspace.mit.edu database. Its not
>> terribly huge
>> (28000 Items) but represents a real living dataset. We already
>> issues with this dataset and Longwell when doing faceted browsing
>> unique values for a predicate vary dramatically in frequency and
>> when trying
>> to get unique values for a particular facet like subject or
>> author, the
>> query time blows up. There has been allot of work trying to solve
>> the issue
>> with things like prop tables. But I've not yet seen the payoff of
>> that work.
> That'd be handy, thanks; I'm guessing 28,000 items will blow up to a
> couple of hundred thousand triples, so a decent enough size to
> hopefully make my benchmarks slow down a little, and give them some
> significance beyond IOWait!
>> I was talking with our DBA about a few of the schema that the prop
>> document was eluding to, he wasn't too fond of the altering of
>> table columns
>> to insert new predicates, he though that wasn't going to perform
>> well on
>> some database because they are not tuned to altering the table
>> that way in
>> production (i.e. with transactional loads etc), he was also
>> concerned with
>> the normalization and need for being able to store more than one
>> predicate/object pair for any subject.
> That's an interesting comment; I wonder how well it performs
> otherwise, though; if your schema is known at the time of the
> deployment (as is more-or-less the case for DSpace), your property
> tables need not expand dynamically, as the predicates you need to
> encode are all already known. Otherwise, I agree - there's a definite
> chance for a performance hit for having to alter large tables in-situ.
> I still don't have a good answer to the normalisation issue!
>> He also thought the use of a table for each predicate was more
>> and we speculated that one might find that any Joins across them to
>> construct a full set of statements for a subject might not be as
>> costly as
>> having to search a tall thin triplestore table to construct them.
> Thinking about this with a slightly clearer head, it does sound like
> an attractive layout for a store; I believe some db engines end up
> creating copies of datasets as temporary tables when you ask them to
> join a table onto itself, which requires some care from the
> framework's side. This would obviate the need for that, as the joins
> are conducted across multiple physical tables, keeping cases with
> large joins of a table onto itself to a minimum.
>> Ultimately, I think I came to the conclusion from our discussion
>> that there
>> were really "three cases" of structure for the tables in dspace.
>> 1.) Typed Subject Tables:
>> Table per DSpaceObject "Type" that capture one record per subject
>> (Community, Collection, Item, Bundle, Bitstream, EPerson, Group,
>> These table have a fixed set of predicates that occur 0..1 times
>> for the
> I guess this is pretty close to a statically defined (ie unchangable)
> version of Kevin Wilkinson's property tables, only with an
> index-friendly 'id' instead of 'subject' keying the records. Other
> than flexibility, is there anything making this unsuitable for the
> metadata? I could imagine, if it were normalised nicely, it could be
> mapped quite efficiently with D2RQ.
>> 2.) Triplestore Tables:
>> Table contains subject, predicate, object. Metadatavalue is the best
>> example. But it could be expanded to suppor thtings like Literal
>> and URI from RDF
>> Typed Predicate Tables:
>> A Predicate Table is explicitly for a specific Predicate. There
>> are a number
>> of these in DSpace... (Community2Community, Community2Collection,
>> Collection2Item, EPerson2Group,...) These examples are primarily
>> container-ship relations, but could be considered to be Predicate
>> because they map explicitly One type of Subject (Community) to one
>> type of
>> Object (Collection) and explicitly define that mapping as a
>> predicate via
>> their existence (Community2Collection i s dcterms:hasPart/
>> relation in DSpace.
>> So, I think the whole database could be "classified" based on
>> these criteria
>> and it would pretty much reflect how such tables would eventually
>> end up
>> expressed in RDF and possibly point to a guideline for how new
>> tables would
>> be added to the database over the long run. Ultimately leading to
>> Guidelines/ Best practices that could be used for future
>> development and
>> extension of the database. Which is sorely lacking from our current
>> development documentation.
> It's interesting that there are all three of these manifesting
> themselves in the current DSpace schema; it would indeed be
> interesting to profile their characteristics in order to get some kind
> of idea as to what should inform the design decision towards one
> pattern over another. When I test SPARQL queries over the various
> storage engines, I'll see if I can identify any characteristics for
> the various relations; see which maps for more efficient queries over
> RDF. That won't be the be-all-and-end-all of the design decision, but
> could prove interesting! Thanks for sharing your thoughts and
> conversations with your DBA!
>> Mark R. Diggory - DSpace Developer and Systems Manager
>> MIT Libraries, Systems and Technology Services
>> Massachusetts Institute of Technology