From: Mitch S. <mit...@be...> - 2007-02-12 19:04:29
|
This is sort of a brain dump; I'm not sure what I really think about this but I'm hoping for some discussion. This email therefore meanders a bit, which is dangerous given that people are already not reading my email all the way through, but some decisions in this area need to be made in the near future and I want to have some thoughts written down about them. Also, given that this is somewhat fuzzy in my head at the moment there's some risk of going into architecture-astronaut mode and getting lost in abstruse philosophical questions. However, given that there are people out there that are in the middle of implementing that abstruse stuff, if we want to piggyback on their work then we have to have some idea about what we want/need. So there are some concrete and immediate things to consider. Also, I know there are some people on this list that know more about this stuff than I do, so hopefully rather than feeling patronized they'll respond to tell me what's up. I've been thinking about how to integrate the relatively stable, well-understood, structured parts of the annotations with the less well understood, less structured aspects. For example, a feature usually has a start and and end point on some reference sequence: there are a few complications (0-based, 1-based, interbase) but generally speaking this is pretty basic and widespread and baked into a variety of software. A highly structured data store like a relational database is a good choice for this kind of information; knowing the structure of your information allows you to store and query it very efficiently. A relational database is kind of like the chain saw of data management, if the chain saw were mounted on an extremely precise industrial robot. On the other hand, there are other things that are harder to predict. Given that there's new research going on all the time that's producing new kinds of data, it'll be a while before there's a chado module for storing those. It's a bad idea to try and design a database schema to store this information now when it's not so well (or widely) understood (c.f. organism vs. taxonomy in chado), but we do want to store it (right?), so IMO we also have to have something less structured than a relational database schema. It's certainly possible to have too little structure, though--every time I hear someone complain about feeling too restricted by a relational schema I want to tell them, "hey, I've got a perfectly general format for storing data: a stream of bits". Having a restriction on the data is just the flip side of knowing something about the data. We do want to be able to efficiently query the data; free text search is nice but even in the google age we still have to wade through lots of irrelevant results. And we want to be able to write software to process the data without having to solve the problem of natural language understanding. So, like Goldilocks, we want to find just the right amount of structure. Papa bear is clearly a relational database; mama bear is XML (or possibly a non-semantic wiki), the document-oriented history of which makes them a little soupy for my taste though this could be debated (and I would be happy to if anyone wants to); and baby bear is RDF. I don't want to write an RDF-advocacy essay, especially since there's already been so much unfulfilled Semantic Web hype. I just want to say that I think it's Just Right structure-wise. And there's a decently large and growing number of tools for dealing with it. If you're not familiar with RDF, here's the wikipedia introduction: ============ Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling knowledge, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion "The sky has the color blue" in RDF is as a triple of specially formatted strings: a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue". ============== If you buy this so far, then the main problem to consider is how to integrate the stuff that fits well in a relational database (feature, reference sequence, start, end) with the stuff that doesn't (? need some examples). In Goldilocks terms I want to have papa bear and baby bear all rolled into one. In web terms I want both relational and semi-structured data to play a role in generating the representation for a single resource (e.g., to serve the data for a single feature entity I want to query both chado (or BioSQL?) and an RDF triplestore and combine the results into an RDF graph). So I've been doing some googling and I've noticed that there are some systems for taking a relational database and serving RDF. Chris, how do you like D2R so far? Do you think chado and BioSQL would work equally well with it, or is one better than the other? It appears that it doesn't integrate directly with a triplestore, is that right? If the client is only aware of RDF, how do we insert and update information? And how do we make sure that information that's added via RDF ends up in the right place in the relational tables? In my googling I've also come across samizdat http://www.nongnu.org/samizdat/ which appears to do the relational table/triplestore integration thing. However, it doesn't appear to support SPARQL. And judging by the mailing list the community there seems pretty small. One of the really interesting aspects of samizdat is that it uses RDF reification to do moderation-type stuff. RDF reification, if you're not familiar, allows you to make RDF statements about other RDF statements. For example, without reification you could make statements like "the sky has the color blue"; reification allows you to say "Mitch says (the sky has the color blue)"--the original statement gets reified into the space of subjects and objects and can then participate in other RDF statements. This all sounds fairly abstruse to me, but IMO it's pretty much exactly what we would want in a community annotation system. We want to store data with some structure but not too much (RDF) and we also want to take those bits of data and allow people to make statements about their source and quality ("annotation foo is from the holmes lab", "annotation foo is computationally-generated", "annotation bar was manually curated", "(annotation bar was manually curated) by so-and-so"). And then we want to take that information about how good a bit of data is and use it to filter or highlight features in the browser or something. "show me all the features I've commented on", "show me all the features from so-and-so", "show me all the features approved by members of my group", "click these buttons to increase/decrease the quality score for this feature", "show me only features with a quality score above 6", and so on. Reification seems like a somewhat more obscure part of the RDF spec, so I'm not sure how well it's supported in RDF tools in general, or even to what extent it needs to be specifically supported. Specifically, I need to try and figure out if the wiki editing in Semantic MediaWiki can be used to enter RDF statements using reification. Or maybe we need to develop some specialized UI for this in any case. As I understand it, one drawback of reification is that you're taking something that was first-order and making it higher-order, which tends to throw lots of computational tractability guarantees out the window. But I don't know what specifically we'd be giving up there. I wonder if we'd be better off avoiding reification and trying to collapse all meta-statements onto their referents somehow (e.g., instead of "Mitch says (the sky is blue)" have something like "the sky is blue" and "the sky was color-determined by Mitch"). Also, I was originally vaguely thinking of trying to squeeze RDF into the DAS2 feature property mechanism but I'm wondering whether or not it would just be better to dispense with DAS2 entirely and just use RDF to describe feature boundaries, type, relationships and whatever else DAS2 covers. I thought DAS2 had some momentum but in trying to get the gmod das2 server running I actually came across what appears to be a syntax error in one of its dependencies (MAGE::XML::Writer from CPAN) so I'm having doubts about how much it's actually getting used. What would be the pros and cons of doing a SPARQL query via D2R<->chado vs. a DAS2 query against chado? IMO the main relevant considerations are query flexibility, query performance, and how easy it is to do in javascript with XHR. I think I'm going to experiment a little with D2R and Virtuoso and see how things go. I believe representing everything with RDF serves Chris' goal of being "semantically transparent", which allows for lots of interesting integration scenarios ("mashups"). And I agree, it's one of those things that buys you lots of power almost for free. RDF is certainly more widely supported than DAS2 is. Also, even though I'm relatively ignorant I'd like to respond to this: http://www.bioontology.org/wiki/index.php/OBD:SPARQL-GO#Representing_the_links_between_genes_and_GO_types and say that although I'm not exactly sure what "interoperation" means here, it seems to me that given a feature URI anyone can make an RDF statement about that concrete feature instance. And all the assertions that have been made about classes can be "exploded" onto the individual instances, right?. So concrete instances seem to me to be the more interoperable way to go. I suppose that if you do everything with individuals it's hard to go back and make assertions about classes--whats's a specific use case for that? I guess the thing that worries me about making universal assertions in biology is that there are so many exceptions. In math/logic/CS you can make universally quantified assertions about abstractions because you make up the abstractions and construct systems using them. The classes/abstractions that you create are endogenous to the systems. But in biology the abstractions are exogenous; the cell doesn't care about the central dogma (e.g., with ncRNA). So classes/abstractions in biology will generally have to grow hairs and distinctions over time, and then what happens to the concrete instances that have been tagged with a certain class name? They have to be manually reclassified, AFAICS. Hence the continuing presence of cvterms where is_obsolete is true. So I guess I'm saying that I think with community annotation it's fine for people to make statements about concrete instances rather than classes, and I believe that they'll generally find it easier to do so. I suppose the question of what's "natural" is one to do user testing on eventually. If we do in fact "let a thousand flowers bloom" then a good query/search engine can still give us digestible pieces to work with, right? I hope. Sorry for the length and stream-of-consciousness-ness. I'm sure a lot of what I'm saying is not new, but I think we have to have these discussions. Unless this is already well-settled territory and someone can point me to a review paper. Mitch |