[Gmod-ajax] Community annotation data model

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

This is sort of a brain dump; I'm not sure what I really think about
this but I'm hoping for some discussion.  This email therefore meanders
a bit, which is dangerous given that people are already not reading my
email all the way through, but some decisions in this area need to be
made in the near future and I want to have some thoughts written down
about them.

Also, given that this is somewhat fuzzy in my head at the moment there's
some risk of going into architecture-astronaut mode and getting lost in
abstruse philosophical questions.  However, given that there are people
out there that are in the middle of implementing that abstruse stuff, if
we want to piggyback on their work then we have to have some idea about
what we want/need.  So there are some concrete and immediate things to
consider.

Also, I know there are some people on this list that know more about
this stuff than I do, so hopefully rather than feeling patronized
they'll respond to tell me what's up.

I've been thinking about how to integrate the relatively stable,
well-understood, structured parts of the annotations with the less well
understood, less structured aspects.  For example, a feature usually has
a start and and end point on some reference sequence: there are a few
complications (0-based, 1-based, interbase) but generally speaking this
is pretty basic and widespread and baked into a variety of software.  A
highly structured data store like a relational database is a good choice
for this kind of information; knowing the structure of your information
allows you to store and query it very efficiently.  A relational
database is kind of like the chain saw of data management, if the chain
saw were mounted on an extremely precise industrial robot.

On the other hand, there are other things that are harder to predict.
Given that there's new research going on all the time that's producing
new kinds of data, it'll be a while before there's a chado module for
storing those.  It's a bad idea to try and design a database schema to
store this information now when it's not so well (or widely) understood
(c.f. organism vs. taxonomy in chado), but we do want to store it
(right?), so IMO we also have to have something less structured than a
relational database schema.

It's certainly possible to have too little structure, though--every time
I hear someone complain about feeling too restricted by a relational
schema I want to tell them, "hey, I've got a perfectly general format
for storing data: a stream of bits".  Having a restriction on the data
is just the flip side of knowing something about the data.  We do want
to be able to efficiently query the data; free text search is nice but
even in the google age we still have to wade through lots of irrelevant
results.  And we want to be able to write software to process the data
without having to solve the problem of natural language understanding.

So, like Goldilocks, we want to find just the right amount of structure.
Papa bear is clearly a relational database; mama bear is XML (or
possibly a non-semantic wiki), the document-oriented history of which
makes them a little soupy for my taste though this could be debated (and
I would be happy to if anyone wants to); and baby bear is RDF.  I don't
want to write an RDF-advocacy essay, especially since there's already
been so much unfulfilled Semantic Web hype.  I just want to say that I
think it's Just Right structure-wise.  And there's a decently large and
growing number of tools for dealing with it.

If you're not familiar with RDF, here's the wikipedia introduction:
============
Resource Description Framework (RDF) is a family of World Wide Web
Consortium (W3C) specifications originally designed as a metadata model
but which has come to be used as a general method of modeling knowledge,
through a variety of syntax formats.

The RDF metadata model is based upon the idea of making statements about
resources in the form of subject-predicate-object expressions, called
triples in RDF terminology. The subject denotes the resource, and the
predicate denotes traits or aspects of the resource and expresses a
relationship between the subject and the object. For example, one way to
represent the notion "The sky has the color blue" in RDF is as a triple
of specially formatted strings: a subject denoting "the sky", a
predicate denoting "has the color", and an object denoting "blue".
==============

If you buy this so far, then the main problem to consider is how to
integrate the stuff that fits well in a relational database (feature,
reference sequence, start, end) with the stuff that doesn't (? need some
examples).  In Goldilocks terms I want to have papa bear and baby bear
all rolled into one.  In web terms I want both relational and
semi-structured data to play a role in generating the representation for
a single resource (e.g., to serve the data for a single feature entity I
want to query both chado (or BioSQL?) and an RDF triplestore and combine
the results into an RDF graph).

So I've been doing some googling and I've noticed that there are some
systems for taking a relational database and serving RDF.  Chris, how do
you like D2R so far?  Do you think chado and BioSQL would work equally
well with it, or is one better than the other?  It appears that it
doesn't integrate directly with a triplestore, is that right?  If the
client is only aware of RDF, how do we insert and update information?
And how do we make sure that information that's added via RDF ends up in
the right place in the relational tables?

In my googling I've also come across samizdat
http://www.nongnu.org/samizdat/
which appears to do the relational table/triplestore integration thing.
However, it doesn't appear to support SPARQL.  And judging by the
mailing list the community there seems pretty small.

One of the really interesting aspects of samizdat is that it uses RDF
reification to do moderation-type stuff.  RDF reification, if you're not
familiar, allows you to make RDF statements about other RDF statements.
For example, without reification you could make statements like "the sky
has the color blue"; reification allows you to say "Mitch says (the sky
has the color blue)"--the original statement gets reified into the space
of subjects and objects and can then participate in other RDF
statements.

This all sounds fairly abstruse to me, but IMO it's pretty much exactly
what we would want in a community annotation system.  We want to store
data with some structure but not too much (RDF) and we also want to take
those bits of data and allow people to make statements about their
source and quality ("annotation foo is from the holmes lab", "annotation
foo is computationally-generated", "annotation bar was manually
curated", "(annotation bar was manually curated) by so-and-so").  And
then we want to take that information about how good a bit of data is
and use it to filter or highlight features in the browser or something.
"show me all the features I've commented on", "show me all the features
from so-and-so", "show me all the features approved by members of my
group", "click these buttons to increase/decrease the quality score for
this feature", "show me only features with a quality score above 6", and
so on.

Reification seems like a somewhat more obscure part of the RDF spec, so
I'm not sure how well it's supported in RDF tools in general, or even to
what extent it needs to be specifically supported.  Specifically, I need
to try and figure out if the wiki editing in Semantic MediaWiki can be
used to enter RDF statements using reification.  Or maybe we need to
develop some specialized UI for this in any case.

As I understand it, one drawback of reification is that you're taking
something that was first-order and making it higher-order, which tends
to throw lots of computational tractability guarantees out the window.
But I don't know what specifically we'd be giving up there.  I wonder if
we'd be better off avoiding reification and trying to collapse all
meta-statements onto their referents somehow (e.g., instead of "Mitch
says (the sky is blue)" have something like "the sky is blue" and "the
sky was color-determined by Mitch").

Also, I was originally vaguely thinking of trying to squeeze RDF into
the DAS2 feature property mechanism but I'm wondering whether or not it
would just be better to dispense with DAS2 entirely and just use RDF to
describe feature boundaries, type, relationships and whatever else DAS2
covers.  I thought DAS2 had some momentum but in trying to get the gmod
das2 server running I actually came across what appears to be a syntax
error in one of its dependencies (MAGE::XML::Writer from CPAN) so I'm
having doubts about how much it's actually getting used.  What would be
the pros and cons of doing a SPARQL query via D2R<->chado vs. a DAS2
query against chado?  IMO the main relevant considerations are query
flexibility, query performance, and how easy it is to do in javascript
with XHR.  I think I'm going to experiment a little with D2R and
Virtuoso and see how things go.

I believe representing everything with RDF serves Chris' goal of being
"semantically transparent", which allows for lots of interesting
integration scenarios ("mashups").  And I agree, it's one of those
things that buys you lots of power almost for free.  RDF is certainly
more widely supported than DAS2 is.

Also, even though I'm relatively ignorant I'd like to respond to this:
http://www.bioontology.org/wiki/index.php/OBD:SPARQL-GO#Representing_the_links_between_genes_and_GO_types
and say that although I'm not exactly sure what "interoperation" means
here, it seems to me that given a feature URI anyone can make an RDF
statement about that concrete feature instance.  And all the assertions
that have been made about classes can be "exploded" onto the individual
instances, right?.  So concrete instances seem to me to be the more
interoperable way to go.  I suppose that if you do everything with
individuals it's hard to go back and make assertions about
classes--whats's a specific use case for that?

I guess the thing that worries me about making universal assertions in
biology is that there are so many exceptions.  In math/logic/CS you can
make universally quantified assertions about abstractions because you
make up the abstractions and construct systems using them.  The
classes/abstractions that you create are endogenous to the systems.  But
in biology the abstractions are exogenous; the cell doesn't care about
the central dogma (e.g., with ncRNA).  So classes/abstractions in
biology will generally have to grow hairs and distinctions over time,
and then what happens to the concrete instances that have been tagged
with a certain class name?  They have to be manually reclassified,
AFAICS.  Hence the continuing presence of cvterms where is_obsolete is
true.

So I guess I'm saying that I think with community annotation it's fine
for people to make statements about concrete instances rather than
classes, and I believe that they'll generally find it easier to do so.
I suppose the question of what's "natural" is one to do user testing on
eventually.  If we do in fact "let a thousand flowers bloom" then a good
query/search engine can still give us digestible pieces to work with,
right?  I hope.

Sorry for the length and stream-of-consciousness-ness.  I'm sure a lot
of what I'm saying is not new, but I think we have to have these
discussions.  Unless this is already well-settled territory and someone
can point me to a review paper.
Mitch