Re: [Gmod-ajax] Community annotation data model

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Chris Mungall wrote:
> Existing relational databases can be wrapped using tools such as D2RQ. 
> There are definitely efficiency considerations. I'm exploring some 
> alternatives of the home-grown variety but don't have anything to 
> report yet. I think writing back to a non-generic schema from RDF is 
> difficult, but I'm not sure we need to do this.
Well, I was vaguely thinking of having a semantic wiki be the interface 
to editing all of the data.  For example, from chado we could generate 
semantic wiki text something like this:
=============
Feature [http://genome.biowiki.org/genomes/scer/I#foo foo] is a 
[[feature type::SOFA:gene]] on [[reference sequence::SGD:I]] from 
[[start base:=5000bp]] to [[end base:=6000bp]].  It is involved with 
[[go term::GO:0019202|amino acid kinase activity]].
=============
This is using the Semantic Wikipedia syntax:
http://ontoworld.org/wiki/Help:Annotation

So when someone edits that wiki text and saves it, I was hoping that the 
right relational table<->RDF mapping (e.g., with D2R) would take the 
attributes that came from chado originally and automagically put them 
back in the right place back in chado.  In other words, the D2R mapping 
(or possibly the semantic wiki software) could be taught to treat the 
"feature type", "reference sequence", "start base", "end base", and "go 
term" attributes specially in the RDF->DB direction.  If this isn't 
already implemented, I think it's worthwhile to do.

Also, if there were attributes that didn't have a "treat specially" 
mapping, they would automagically go into a triplestore.  I'm hoping not 
to have to implement that myself but I think it's worthwhile as well.

This is a big part of what "genome wiki" means to me--being able to edit 
all of the information (both from chado and from the triplestore), 
hopefully all through the same interface.  Also, if this kind of editing 
is already implemented then that saves us from having to implement a 
custom genomic information editor in the browser.  If we wanted to, 
later on we might implement some kind of click&drag editing interface in 
the browser, or somehow plug in AmiGO for adding GO terms but that would 
be optional.

I agree with all of the things you say below, but it seems like you're 
mostly talking about the us->community direction. Querying 
chado+triplestore seems relatively straightforward <fingers crossed>, 
and annotation uploading makes sense to me (e.g., 
gmod_bulk_load_gff3.pl); it's the editing that I'm more worried about.  
I had the impression (hope) that it was mostly implemented and we could 
just wire it all together in a smart way but if not I'd be inclined to 
take a stab at it.

Mitch

> We want a community-based way of sharing data that fits neatly into 1d 
> feature paradigm, and we want this to be fast, standards-based and 
> interoperable with current genomics tools, so genomics datamodels and 
> exchange formats will continue to play a part. We may also want a way 
> of exposing the inherent semantics in those little boxes to computers 
> that don't speak genomics. It's unclear exactly who gains, when and 
> how, but the cost is not so high (avenues include: SPARQL queries for 
> genome databases; Das2rdf; use of microformats and rdf in gbrowse 
> display_).
>
> Then there are the annotations on these little boxes; statements about 
> the underlying biological entities. On the one hand this is the wild 
> untrammelled frontier - these entities may be linked to other entities 
> which are themselves described by a composites of other interlinked 
> entities. We can take a ride traversing these links through multiple 
> levels of biological granularity, from atomic structures through to 
> anatomical structures, physiological processes, phenotypes, 
> environments, life-forms living in the hydrothermal vents on Jupiter's 
> moons... OK, perhaps RDF can't deliver on the astrobiology quite yet, 
> but it seems that this open-ended world beyond genomics is a good 
> reason to try RDF.
>
> Orthogonal to this is the "reification" model. Even in our wiki-esque 
> community model we want to conform to good annotation practice and 
> encourage all links to be accompanied with provenance, evidence and so 
> on.
>
> What does this mean in terms of implementation? It could be fairly 
> simple. GBrowse could be augmented by a 3rd party triple-store. The 
> primary datastore would continue to be the genomics schema of choice, 
> eg chado, but freeform 3rd party annotations on features could go in 
> the triple-store. I have a few ideas about how this could be layered 
> on top of a gbrowse type display, and you have the advantage of 
> transparency to generic semweb software, to the extent it exists in 
> usable forms at the moment.
>
> This seems a fairly low risk approach to the community annotation 
> store problem. In fact, other approaches will be higher risk as they 
> will require rolling your own technology. Triplestores can be slow for 
> complex multi-join queries but I think many of your use cases will 
> involve simple neighbourhood graphs. Queries such as "find all genes 
> upstream of genes in a pathway implicated in disease X with function 
> Y" will perform dreadfully if you take the ontological closure into 
> account. We're working on technology for this in the Berkeley 
> Ontologies Project but you shouldn't place any dependencies on this yet.
>
> Well I've gone on a bit and haven't really covered all the bases - my 
> recommendation is to proceed enthusiastically but cautiously. As you 
> can see I'm part gung ho about rdf/semweb and part skeptical. The 
> basic idea of linking by URIs is simple and cool and powerful. 
> ironically, I think it is the semantic part that is somewhat lacking 
> with the lack of scalable OWL support, but this is changing....
>