From: Mitch S. <mit...@be...> - 2007-02-19 09:19:54
|
Sorry for the brain dump earlier--here's a shorter, better-digested version. As I see it, the main point of having a genome wiki is to make genomic data editable. It's important to note that making *data* editable is different from making *documents* editable--I expect data to be interpretable using software, but while documents can be managed by software, actually interpreting them using software is definitely an unsolved problem. The data/document distinction is reflected in the difference between a semantic wiki and a regular wiki--in a semantic wiki the content contains handles for software to grab onto, but the slippery, hard-to-parse natural language content of a non-semantic wiki is much, much harder for software to pull information out of. For data editing, lots of UIs exist already, of course. There's an army of visual basic programmers out there putting editing interfaces in front of relational databases. However, those data-editing UIs (and the databases behind them) are relatively inflexible; if some new situation arises and you want to store some new kind of information then you're SOL until your local programmer can get around to adding support for it. This is the reason for the appalling success of Excel and Access as data-management systems. Having done data-management work in the biological trenches literally right next to the lab benches, I can tell you that this is an ongoing pain point. Flexibility is especially important in a community annotation context, where you want people to be able to add information without having to agree on a data model first. So the semantic wiki and its RDF data model occupy a nice middle ground between fast and efficient but relatively inflexible relational databases and the document-style wiki that's flexible but not really queryable. The data content of a semantic wiki is more useful than pure natural language wiki content because you can pull data out of the semantic wiki and do something with it, like adding graphical decorations to features that have certain kinds of wiki annotations. Generic software that handles RDF (like Piggy Bank) can also make use of the semantic wiki data. To some extent we can have our cake and eat it too by by integrating RDF data stores ("triplestores") with relational databases. You can start out with a fast, efficient relational skeleton that's already supported by lots of software (like chado) and then hang whatever new kinds of information you want off of it. The new kinds of information go into the triplestore, and at query time, data from the relational tables and from the triplestore can be blended together. Over time, I expect some kinds of new information to get better understood. Once there is consensus on how a particular kind of information should be modeled, it can be moved from the triplestore into a set of relational tables. When this happens, it's possible to keep the same client-side RDF view of the data, with the only differences being that the whole system gets faster, and software for processing and analyzing the new data gets easier to write. So, if you buy all this, then IMO the next steps in this area are: 1. Evaluate RDF/relational integration tools. The main contenders appear to be D2R and Virtuoso. D2R is nice because it works with existing databases. Virtuoso is nice because it has good relational/triplestore integration. Whether it's easier to integrate D2R with a triplestore or port chado to Virtuoso is an open question. 2. Get semantic mediawiki to talk to the chosen triplestore. 3. Figure out how the namespaces/idspaces ought to work. We want to have a system that's flat enough that it's easy for people to make wiki links between entities, but deep enough that IDs from various sources/applications don't step on each other. My first priority at the moment is to try and get some kind of persistent feature upload/display working; my hope is that we'll have thought through the IDspace issues by the time we get to implementing that part. Regards, Mitch |