[Gmod-ajax] summary (was: Community annotation data model)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Sorry for the brain dump earlier--here's a shorter, better-digested
version.

As I see it, the main point of having a genome wiki is to make genomic
data editable.  It's important to note that making *data* editable is
different from making *documents* editable--I expect data to be
interpretable using software, but while documents can be managed by
software, actually interpreting them using software is definitely an
unsolved problem.  The data/document distinction is reflected in the
difference between a semantic wiki and a regular wiki--in a semantic
wiki the content contains handles for software to grab onto, but the
slippery, hard-to-parse natural language content of a non-semantic wiki
is much, much harder for software to pull information out of.

For data editing, lots of UIs exist already, of course.  There's an army
of visual basic programmers out there putting editing interfaces in
front of relational databases.  However, those data-editing UIs (and the
databases behind them) are relatively inflexible; if some new situation
arises and you want to store some new kind of information then you're
SOL until your local programmer can get around to adding support for it.
This is the reason for the appalling success of Excel and Access as
data-management systems.  Having done data-management work in the
biological trenches literally right next to the lab benches, I can tell
you that this is an ongoing pain point.  Flexibility is especially
important in a community annotation context, where you want people to be
able to add information without having to agree on a data model first.

So the semantic wiki and its RDF data model occupy a nice middle ground
between fast and efficient but relatively inflexible relational
databases and the document-style wiki that's flexible but not really
queryable.  The data content of a semantic wiki is more useful than pure
natural language wiki content because you can pull data out of the
semantic wiki and do something with it, like adding graphical
decorations to features that have certain kinds of wiki annotations.
Generic software that handles RDF (like Piggy Bank) can also make use of
the semantic wiki data.

To some extent we can have our cake and eat it too by by integrating RDF
data stores ("triplestores") with relational databases.   You can start
out with a fast, efficient relational skeleton that's already supported
by lots of software (like chado) and then hang whatever new kinds of
information you want off of it.  The new kinds of information go into
the triplestore, and at query time, data from the relational tables and
from the triplestore can be blended together.

Over time, I expect some kinds of new information to get better
understood.  Once there is consensus on how a particular kind of
information should be modeled, it can be moved from the triplestore into
a set of relational tables.  When this happens, it's possible to keep
the same client-side RDF view of the data, with the only differences
being that the whole system gets faster, and software for processing and
analyzing the new data gets easier to write.

So, if you buy all this, then IMO the next steps in this area are:

1. Evaluate RDF/relational integration tools.  The main contenders
appear to be D2R and Virtuoso.  D2R is nice because it works with
existing databases.  Virtuoso is nice because it has good
relational/triplestore integration.  Whether it's easier to integrate
D2R with a triplestore or port chado to Virtuoso is an open question.

2. Get semantic mediawiki to talk to the chosen triplestore.

3. Figure out how the namespaces/idspaces ought to work.  We want to
have a system that's flat enough that it's easy for people to make wiki
links between entities, but deep enough that IDs from various
sources/applications don't step on each other.

My first priority at the moment is to try and get some kind of
persistent feature upload/display working; my hope is that we'll have
thought through the IDspace issues by the time we get to implementing
that part.

Regards,
Mitch