From: Mitch S. <mit...@be...> - 2007-02-12 19:04:29
|
This is sort of a brain dump; I'm not sure what I really think about this but I'm hoping for some discussion. This email therefore meanders a bit, which is dangerous given that people are already not reading my email all the way through, but some decisions in this area need to be made in the near future and I want to have some thoughts written down about them. Also, given that this is somewhat fuzzy in my head at the moment there's some risk of going into architecture-astronaut mode and getting lost in abstruse philosophical questions. However, given that there are people out there that are in the middle of implementing that abstruse stuff, if we want to piggyback on their work then we have to have some idea about what we want/need. So there are some concrete and immediate things to consider. Also, I know there are some people on this list that know more about this stuff than I do, so hopefully rather than feeling patronized they'll respond to tell me what's up. I've been thinking about how to integrate the relatively stable, well-understood, structured parts of the annotations with the less well understood, less structured aspects. For example, a feature usually has a start and and end point on some reference sequence: there are a few complications (0-based, 1-based, interbase) but generally speaking this is pretty basic and widespread and baked into a variety of software. A highly structured data store like a relational database is a good choice for this kind of information; knowing the structure of your information allows you to store and query it very efficiently. A relational database is kind of like the chain saw of data management, if the chain saw were mounted on an extremely precise industrial robot. On the other hand, there are other things that are harder to predict. Given that there's new research going on all the time that's producing new kinds of data, it'll be a while before there's a chado module for storing those. It's a bad idea to try and design a database schema to store this information now when it's not so well (or widely) understood (c.f. organism vs. taxonomy in chado), but we do want to store it (right?), so IMO we also have to have something less structured than a relational database schema. It's certainly possible to have too little structure, though--every time I hear someone complain about feeling too restricted by a relational schema I want to tell them, "hey, I've got a perfectly general format for storing data: a stream of bits". Having a restriction on the data is just the flip side of knowing something about the data. We do want to be able to efficiently query the data; free text search is nice but even in the google age we still have to wade through lots of irrelevant results. And we want to be able to write software to process the data without having to solve the problem of natural language understanding. So, like Goldilocks, we want to find just the right amount of structure. Papa bear is clearly a relational database; mama bear is XML (or possibly a non-semantic wiki), the document-oriented history of which makes them a little soupy for my taste though this could be debated (and I would be happy to if anyone wants to); and baby bear is RDF. I don't want to write an RDF-advocacy essay, especially since there's already been so much unfulfilled Semantic Web hype. I just want to say that I think it's Just Right structure-wise. And there's a decently large and growing number of tools for dealing with it. If you're not familiar with RDF, here's the wikipedia introduction: ============ Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling knowledge, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion "The sky has the color blue" in RDF is as a triple of specially formatted strings: a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue". ============== If you buy this so far, then the main problem to consider is how to integrate the stuff that fits well in a relational database (feature, reference sequence, start, end) with the stuff that doesn't (? need some examples). In Goldilocks terms I want to have papa bear and baby bear all rolled into one. In web terms I want both relational and semi-structured data to play a role in generating the representation for a single resource (e.g., to serve the data for a single feature entity I want to query both chado (or BioSQL?) and an RDF triplestore and combine the results into an RDF graph). So I've been doing some googling and I've noticed that there are some systems for taking a relational database and serving RDF. Chris, how do you like D2R so far? Do you think chado and BioSQL would work equally well with it, or is one better than the other? It appears that it doesn't integrate directly with a triplestore, is that right? If the client is only aware of RDF, how do we insert and update information? And how do we make sure that information that's added via RDF ends up in the right place in the relational tables? In my googling I've also come across samizdat http://www.nongnu.org/samizdat/ which appears to do the relational table/triplestore integration thing. However, it doesn't appear to support SPARQL. And judging by the mailing list the community there seems pretty small. One of the really interesting aspects of samizdat is that it uses RDF reification to do moderation-type stuff. RDF reification, if you're not familiar, allows you to make RDF statements about other RDF statements. For example, without reification you could make statements like "the sky has the color blue"; reification allows you to say "Mitch says (the sky has the color blue)"--the original statement gets reified into the space of subjects and objects and can then participate in other RDF statements. This all sounds fairly abstruse to me, but IMO it's pretty much exactly what we would want in a community annotation system. We want to store data with some structure but not too much (RDF) and we also want to take those bits of data and allow people to make statements about their source and quality ("annotation foo is from the holmes lab", "annotation foo is computationally-generated", "annotation bar was manually curated", "(annotation bar was manually curated) by so-and-so"). And then we want to take that information about how good a bit of data is and use it to filter or highlight features in the browser or something. "show me all the features I've commented on", "show me all the features from so-and-so", "show me all the features approved by members of my group", "click these buttons to increase/decrease the quality score for this feature", "show me only features with a quality score above 6", and so on. Reification seems like a somewhat more obscure part of the RDF spec, so I'm not sure how well it's supported in RDF tools in general, or even to what extent it needs to be specifically supported. Specifically, I need to try and figure out if the wiki editing in Semantic MediaWiki can be used to enter RDF statements using reification. Or maybe we need to develop some specialized UI for this in any case. As I understand it, one drawback of reification is that you're taking something that was first-order and making it higher-order, which tends to throw lots of computational tractability guarantees out the window. But I don't know what specifically we'd be giving up there. I wonder if we'd be better off avoiding reification and trying to collapse all meta-statements onto their referents somehow (e.g., instead of "Mitch says (the sky is blue)" have something like "the sky is blue" and "the sky was color-determined by Mitch"). Also, I was originally vaguely thinking of trying to squeeze RDF into the DAS2 feature property mechanism but I'm wondering whether or not it would just be better to dispense with DAS2 entirely and just use RDF to describe feature boundaries, type, relationships and whatever else DAS2 covers. I thought DAS2 had some momentum but in trying to get the gmod das2 server running I actually came across what appears to be a syntax error in one of its dependencies (MAGE::XML::Writer from CPAN) so I'm having doubts about how much it's actually getting used. What would be the pros and cons of doing a SPARQL query via D2R<->chado vs. a DAS2 query against chado? IMO the main relevant considerations are query flexibility, query performance, and how easy it is to do in javascript with XHR. I think I'm going to experiment a little with D2R and Virtuoso and see how things go. I believe representing everything with RDF serves Chris' goal of being "semantically transparent", which allows for lots of interesting integration scenarios ("mashups"). And I agree, it's one of those things that buys you lots of power almost for free. RDF is certainly more widely supported than DAS2 is. Also, even though I'm relatively ignorant I'd like to respond to this: http://www.bioontology.org/wiki/index.php/OBD:SPARQL-GO#Representing_the_links_between_genes_and_GO_types and say that although I'm not exactly sure what "interoperation" means here, it seems to me that given a feature URI anyone can make an RDF statement about that concrete feature instance. And all the assertions that have been made about classes can be "exploded" onto the individual instances, right?. So concrete instances seem to me to be the more interoperable way to go. I suppose that if you do everything with individuals it's hard to go back and make assertions about classes--whats's a specific use case for that? I guess the thing that worries me about making universal assertions in biology is that there are so many exceptions. In math/logic/CS you can make universally quantified assertions about abstractions because you make up the abstractions and construct systems using them. The classes/abstractions that you create are endogenous to the systems. But in biology the abstractions are exogenous; the cell doesn't care about the central dogma (e.g., with ncRNA). So classes/abstractions in biology will generally have to grow hairs and distinctions over time, and then what happens to the concrete instances that have been tagged with a certain class name? They have to be manually reclassified, AFAICS. Hence the continuing presence of cvterms where is_obsolete is true. So I guess I'm saying that I think with community annotation it's fine for people to make statements about concrete instances rather than classes, and I believe that they'll generally find it easier to do so. I suppose the question of what's "natural" is one to do user testing on eventually. If we do in fact "let a thousand flowers bloom" then a good query/search engine can still give us digestible pieces to work with, right? I hope. Sorry for the length and stream-of-consciousness-ness. I'm sure a lot of what I'm saying is not new, but I think we have to have these discussions. Unless this is already well-settled territory and someone can point me to a review paper. Mitch |
From: Chris M. <cj...@fr...> - 2007-02-13 07:22:46
|
Hi Mitch Wow, that's quite a lot packed into that email! In a good way. You ask some good questions, and I certainly don't know the answer to all of them. RDF is certainly no panacea. There are definitely strikes against it. The way it is layered on top of XML is problematic (there are other syntaxes to choose from, some quite pleasant like n3, but this all just serves to make the barrier for entry higher). Tools and libraries can insulate your from this, to an extent. The layering of OWL (the web ontology language) onto RDF is also tricky, and at best RDF is quite a low-level way of expressing OWL. All relations in RDF are binary; the subject-predicate-object triple: you can say "Socrates_beard has_color white", but if you want to time- index this to say, 400 BC, you have to introduce ontologically problematic entities such as "socrates beard in 400 BC". This isn't a big deal for the semantic web for various reasons, but is important for accurate representation of biological entities that exist in time. Having said that, RDF is definitely our best shot at exposing some amount of database semantics in a maximally accessible and interoperable with with a minimum amount of coordination and schema churn. (I've seen a lot of grand interoperation schemes come and go over the years so this is actually quite a strong statement). Note that Chado isn't so far off RDF with it's various subject- predicate-object linking tables; we just chose to go with more of a hybrid approach; Chado is intended to control the range of what can be stated more than is possible with RDF and related technologies. The result is in principle quite easy to map to RDF, giving some of the benefits of both. I don't think it's an either or thing when it comes to RDF vs domain specific exchange formats. You correctly identified the tradeoffs with, for example, DAS2 vs RDF; whilst those tradeoffs exist there is room for both to live side-by-side. Now this may not true for ever - I'm longing for the day when it is possible to specify the semantics of data in a way that is computable and efficient, but we're not quite there yet. This doesn't mean we can't make a start, and some kind of RDF encoding of chado-style feature graphs and feature location graphs would be a good start. This would give us way of wrapping DAS2 and genomics databases that the wider semantic web can understand. You identify reification - statements about statements - as key for annotations - I agree. You may also want to check out Named Graphs too. Unfortunately the tool support for either is not as mature yet; you can still use reification, just in a low-level way. I'm not so worried by the seeming higher-order aspects of reification. But I won't go into this, as it's fairy abstruse, and I'm not sure I believe myself, which is a kind of curious higher- order statement in itself. Existing relational databases can be wrapped using tools such as D2RQ. There are definitely efficiency considerations. I'm exploring some alternatives of the home-grown variety but don't have anything to report yet. I think writing back to a non-generic schema from RDF is difficult, but I'm not sure we need to do this. OK, before we get too carried away we should check what problems we are trying to solve. Annotation means different things to different people (and something slightly different in the semantic web world unfortunately). We want a community-based way of sharing data that fits neatly into 1d feature paradigm, and we want this to be fast, standards-based and interoperable with current genomics tools, so genomics datamodels and exchange formats will continue to play a part. We may also want a way of exposing the inherent semantics in those little boxes to computers that don't speak genomics. It's unclear exactly who gains, when and how, but the cost is not so high (avenues include: SPARQL queries for genome databases; Das2rdf; use of microformats and rdf in gbrowse display_). Then there are the annotations on these little boxes; statements about the underlying biological entities. On the one hand this is the wild untrammelled frontier - these entities may be linked to other entities which are themselves described by a composites of other interlinked entities. We can take a ride traversing these links through multiple levels of biological granularity, from atomic structures through to anatomical structures, physiological processes, phenotypes, environments, life-forms living in the hydrothermal vents on Jupiter's moons... OK, perhaps RDF can't deliver on the astrobiology quite yet, but it seems that this open-ended world beyond genomics is a good reason to try RDF. Orthogonal to this is the "reification" model. Even in our wiki-esque community model we want to conform to good annotation practice and encourage all links to be accompanied with provenance, evidence and so on. What does this mean in terms of implementation? It could be fairly simple. GBrowse could be augmented by a 3rd party triple-store. The primary datastore would continue to be the genomics schema of choice, eg chado, but freeform 3rd party annotations on features could go in the triple-store. I have a few ideas about how this could be layered on top of a gbrowse type display, and you have the advantage of transparency to generic semweb software, to the extent it exists in usable forms at the moment. This seems a fairly low risk approach to the community annotation store problem. In fact, other approaches will be higher risk as they will require rolling your own technology. Triplestores can be slow for complex multi-join queries but I think many of your use cases will involve simple neighbourhood graphs. Queries such as "find all genes upstream of genes in a pathway implicated in disease X with function Y" will perform dreadfully if you take the ontological closure into account. We're working on technology for this in the Berkeley Ontologies Project but you shouldn't place any dependencies on this yet. Well I've gone on a bit and haven't really covered all the bases - my recommendation is to proceed enthusiastically but cautiously. As you can see I'm part gung ho about rdf/semweb and part skeptical. The basic idea of linking by URIs is simple and cool and powerful. ironically, I think it is the semantic part that is somewhat lacking with the lack of scalable OWL support, but this is changing.... On Feb 12, 2007, at 11:03 AM, Mitch Skinner wrote: > This is sort of a brain dump; I'm not sure what I really think about > this but I'm hoping for some discussion. This email therefore > meanders > a bit, which is dangerous given that people are already not reading my > email all the way through, but some decisions in this area need to be > made in the near future and I want to have some thoughts written down > about them. > > Also, given that this is somewhat fuzzy in my head at the moment > there's > some risk of going into architecture-astronaut mode and getting > lost in > abstruse philosophical questions. However, given that there are > people > out there that are in the middle of implementing that abstruse > stuff, if > we want to piggyback on their work then we have to have some idea > about > what we want/need. So there are some concrete and immediate things to > consider. > > Also, I know there are some people on this list that know more about > this stuff than I do, so hopefully rather than feeling patronized > they'll respond to tell me what's up. > > I've been thinking about how to integrate the relatively stable, > well-understood, structured parts of the annotations with the less > well > understood, less structured aspects. For example, a feature > usually has > a start and and end point on some reference sequence: there are a few > complications (0-based, 1-based, interbase) but generally speaking > this > is pretty basic and widespread and baked into a variety of > software. A > highly structured data store like a relational database is a good > choice > for this kind of information; knowing the structure of your > information > allows you to store and query it very efficiently. A relational > database is kind of like the chain saw of data management, if the > chain > saw were mounted on an extremely precise industrial robot. > > On the other hand, there are other things that are harder to predict. > Given that there's new research going on all the time that's producing > new kinds of data, it'll be a while before there's a chado module for > storing those. It's a bad idea to try and design a database schema to > store this information now when it's not so well (or widely) > understood > (c.f. organism vs. taxonomy in chado), but we do want to store it > (right?), so IMO we also have to have something less structured than a > relational database schema. > > It's certainly possible to have too little structure, though--every > time > I hear someone complain about feeling too restricted by a relational > schema I want to tell them, "hey, I've got a perfectly general format > for storing data: a stream of bits". Having a restriction on the data > is just the flip side of knowing something about the data. We do want > to be able to efficiently query the data; free text search is nice but > even in the google age we still have to wade through lots of > irrelevant > results. And we want to be able to write software to process the data > without having to solve the problem of natural language understanding. > > So, like Goldilocks, we want to find just the right amount of > structure. > Papa bear is clearly a relational database; mama bear is XML (or > possibly a non-semantic wiki), the document-oriented history of which > makes them a little soupy for my taste though this could be debated > (and > I would be happy to if anyone wants to); and baby bear is RDF. I > don't > want to write an RDF-advocacy essay, especially since there's already > been so much unfulfilled Semantic Web hype. I just want to say that I > think it's Just Right structure-wise. And there's a decently large > and > growing number of tools for dealing with it. > > If you're not familiar with RDF, here's the wikipedia introduction: > ============ > Resource Description Framework (RDF) is a family of World Wide Web > Consortium (W3C) specifications originally designed as a metadata > model > but which has come to be used as a general method of modeling > knowledge, > through a variety of syntax formats. > > The RDF metadata model is based upon the idea of making statements > about > resources in the form of subject-predicate-object expressions, called > triples in RDF terminology. The subject denotes the resource, and the > predicate denotes traits or aspects of the resource and expresses a > relationship between the subject and the object. For example, one > way to > represent the notion "The sky has the color blue" in RDF is as a > triple > of specially formatted strings: a subject denoting "the sky", a > predicate denoting "has the color", and an object denoting "blue". > ============== > > If you buy this so far, then the main problem to consider is how to > integrate the stuff that fits well in a relational database (feature, > reference sequence, start, end) with the stuff that doesn't (? need > some > examples). In Goldilocks terms I want to have papa bear and baby bear > all rolled into one. In web terms I want both relational and > semi-structured data to play a role in generating the > representation for > a single resource (e.g., to serve the data for a single feature > entity I > want to query both chado (or BioSQL?) and an RDF triplestore and > combine > the results into an RDF graph). > > So I've been doing some googling and I've noticed that there are some > systems for taking a relational database and serving RDF. Chris, > how do > you like D2R so far? Do you think chado and BioSQL would work equally > well with it, or is one better than the other? It appears that it > doesn't integrate directly with a triplestore, is that right? If the > client is only aware of RDF, how do we insert and update information? > And how do we make sure that information that's added via RDF ends > up in > the right place in the relational tables? > > In my googling I've also come across samizdat > http://www.nongnu.org/samizdat/ > which appears to do the relational table/triplestore integration > thing. > However, it doesn't appear to support SPARQL. And judging by the > mailing list the community there seems pretty small. > > One of the really interesting aspects of samizdat is that it uses RDF > reification to do moderation-type stuff. RDF reification, if > you're not > familiar, allows you to make RDF statements about other RDF > statements. > For example, without reification you could make statements like > "the sky > has the color blue"; reification allows you to say "Mitch says (the > sky > has the color blue)"--the original statement gets reified into the > space > of subjects and objects and can then participate in other RDF > statements. > > This all sounds fairly abstruse to me, but IMO it's pretty much > exactly > what we would want in a community annotation system. We want to store > data with some structure but not too much (RDF) and we also want to > take > those bits of data and allow people to make statements about their > source and quality ("annotation foo is from the holmes lab", > "annotation > foo is computationally-generated", "annotation bar was manually > curated", "(annotation bar was manually curated) by so-and-so"). And > then we want to take that information about how good a bit of data is > and use it to filter or highlight features in the browser or > something. > "show me all the features I've commented on", "show me all the > features > from so-and-so", "show me all the features approved by members of my > group", "click these buttons to increase/decrease the quality score > for > this feature", "show me only features with a quality score above > 6", and > so on. > > Reification seems like a somewhat more obscure part of the RDF > spec, so > I'm not sure how well it's supported in RDF tools in general, or > even to > what extent it needs to be specifically supported. Specifically, I > need > to try and figure out if the wiki editing in Semantic MediaWiki can be > used to enter RDF statements using reification. Or maybe we need to > develop some specialized UI for this in any case. > > As I understand it, one drawback of reification is that you're taking > something that was first-order and making it higher-order, which tends > to throw lots of computational tractability guarantees out the window. > But I don't know what specifically we'd be giving up there. I > wonder if > we'd be better off avoiding reification and trying to collapse all > meta-statements onto their referents somehow (e.g., instead of "Mitch > says (the sky is blue)" have something like "the sky is blue" and "the > sky was color-determined by Mitch"). > > Also, I was originally vaguely thinking of trying to squeeze RDF into > the DAS2 feature property mechanism but I'm wondering whether or > not it > would just be better to dispense with DAS2 entirely and just use > RDF to > describe feature boundaries, type, relationships and whatever else > DAS2 > covers. I thought DAS2 had some momentum but in trying to get the > gmod > das2 server running I actually came across what appears to be a syntax > error in one of its dependencies (MAGE::XML::Writer from CPAN) so I'm > having doubts about how much it's actually getting used. What > would be > the pros and cons of doing a SPARQL query via D2R<->chado vs. a DAS2 > query against chado? IMO the main relevant considerations are query > flexibility, query performance, and how easy it is to do in javascript > with XHR. I think I'm going to experiment a little with D2R and > Virtuoso and see how things go. > > I believe representing everything with RDF serves Chris' goal of being > "semantically transparent", which allows for lots of interesting > integration scenarios ("mashups"). And I agree, it's one of those > things that buys you lots of power almost for free. RDF is certainly > more widely supported than DAS2 is. > > Also, even though I'm relatively ignorant I'd like to respond to this: > http://www.bioontology.org/wiki/index.php/OBD:SPARQL- > GO#Representing_the_links_between_genes_and_GO_types > and say that although I'm not exactly sure what "interoperation" means > here, it seems to me that given a feature URI anyone can make an RDF > statement about that concrete feature instance. And all the > assertions > that have been made about classes can be "exploded" onto the > individual > instances, right?. So concrete instances seem to me to be the more > interoperable way to go. I suppose that if you do everything with > individuals it's hard to go back and make assertions about > classes--whats's a specific use case for that? > > I guess the thing that worries me about making universal assertions in > biology is that there are so many exceptions. In math/logic/CS you > can > make universally quantified assertions about abstractions because you > make up the abstractions and construct systems using them. The > classes/abstractions that you create are endogenous to the > systems. But > in biology the abstractions are exogenous; the cell doesn't care about > the central dogma (e.g., with ncRNA). So classes/abstractions in > biology will generally have to grow hairs and distinctions over time, > and then what happens to the concrete instances that have been tagged > with a certain class name? They have to be manually reclassified, > AFAICS. Hence the continuing presence of cvterms where is_obsolete is > true. > > So I guess I'm saying that I think with community annotation it's fine > for people to make statements about concrete instances rather than > classes, and I believe that they'll generally find it easier to do so. > I suppose the question of what's "natural" is one to do user > testing on > eventually. If we do in fact "let a thousand flowers bloom" then a > good > query/search engine can still give us digestible pieces to work with, > right? I hope. > > Sorry for the length and stream-of-consciousness-ness. I'm sure a lot > of what I'm saying is not new, but I think we have to have these > discussions. Unless this is already well-settled territory and > someone > can point me to a review paper. > Mitch > > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax > |
From: Mitch S. <mit...@be...> - 2007-02-13 23:48:57
|
Chris Mungall wrote: > Existing relational databases can be wrapped using tools such as D2RQ. > There are definitely efficiency considerations. I'm exploring some > alternatives of the home-grown variety but don't have anything to > report yet. I think writing back to a non-generic schema from RDF is > difficult, but I'm not sure we need to do this. Well, I was vaguely thinking of having a semantic wiki be the interface to editing all of the data. For example, from chado we could generate semantic wiki text something like this: ============= Feature [http://genome.biowiki.org/genomes/scer/I#foo foo] is a [[feature type::SOFA:gene]] on [[reference sequence::SGD:I]] from [[start base:=5000bp]] to [[end base:=6000bp]]. It is involved with [[go term::GO:0019202|amino acid kinase activity]]. ============= This is using the Semantic Wikipedia syntax: http://ontoworld.org/wiki/Help:Annotation So when someone edits that wiki text and saves it, I was hoping that the right relational table<->RDF mapping (e.g., with D2R) would take the attributes that came from chado originally and automagically put them back in the right place back in chado. In other words, the D2R mapping (or possibly the semantic wiki software) could be taught to treat the "feature type", "reference sequence", "start base", "end base", and "go term" attributes specially in the RDF->DB direction. If this isn't already implemented, I think it's worthwhile to do. Also, if there were attributes that didn't have a "treat specially" mapping, they would automagically go into a triplestore. I'm hoping not to have to implement that myself but I think it's worthwhile as well. This is a big part of what "genome wiki" means to me--being able to edit all of the information (both from chado and from the triplestore), hopefully all through the same interface. Also, if this kind of editing is already implemented then that saves us from having to implement a custom genomic information editor in the browser. If we wanted to, later on we might implement some kind of click&drag editing interface in the browser, or somehow plug in AmiGO for adding GO terms but that would be optional. I agree with all of the things you say below, but it seems like you're mostly talking about the us->community direction. Querying chado+triplestore seems relatively straightforward <fingers crossed>, and annotation uploading makes sense to me (e.g., gmod_bulk_load_gff3.pl); it's the editing that I'm more worried about. I had the impression (hope) that it was mostly implemented and we could just wire it all together in a smart way but if not I'd be inclined to take a stab at it. Mitch > We want a community-based way of sharing data that fits neatly into 1d > feature paradigm, and we want this to be fast, standards-based and > interoperable with current genomics tools, so genomics datamodels and > exchange formats will continue to play a part. We may also want a way > of exposing the inherent semantics in those little boxes to computers > that don't speak genomics. It's unclear exactly who gains, when and > how, but the cost is not so high (avenues include: SPARQL queries for > genome databases; Das2rdf; use of microformats and rdf in gbrowse > display_). > > Then there are the annotations on these little boxes; statements about > the underlying biological entities. On the one hand this is the wild > untrammelled frontier - these entities may be linked to other entities > which are themselves described by a composites of other interlinked > entities. We can take a ride traversing these links through multiple > levels of biological granularity, from atomic structures through to > anatomical structures, physiological processes, phenotypes, > environments, life-forms living in the hydrothermal vents on Jupiter's > moons... OK, perhaps RDF can't deliver on the astrobiology quite yet, > but it seems that this open-ended world beyond genomics is a good > reason to try RDF. > > Orthogonal to this is the "reification" model. Even in our wiki-esque > community model we want to conform to good annotation practice and > encourage all links to be accompanied with provenance, evidence and so > on. > > What does this mean in terms of implementation? It could be fairly > simple. GBrowse could be augmented by a 3rd party triple-store. The > primary datastore would continue to be the genomics schema of choice, > eg chado, but freeform 3rd party annotations on features could go in > the triple-store. I have a few ideas about how this could be layered > on top of a gbrowse type display, and you have the advantage of > transparency to generic semweb software, to the extent it exists in > usable forms at the moment. > > This seems a fairly low risk approach to the community annotation > store problem. In fact, other approaches will be higher risk as they > will require rolling your own technology. Triplestores can be slow for > complex multi-join queries but I think many of your use cases will > involve simple neighbourhood graphs. Queries such as "find all genes > upstream of genes in a pathway implicated in disease X with function > Y" will perform dreadfully if you take the ontological closure into > account. We're working on technology for this in the Berkeley > Ontologies Project but you shouldn't place any dependencies on this yet. > > Well I've gone on a bit and haven't really covered all the bases - my > recommendation is to proceed enthusiastically but cautiously. As you > can see I'm part gung ho about rdf/semweb and part skeptical. The > basic idea of linking by URIs is simple and cool and powerful. > ironically, I think it is the semantic part that is somewhat lacking > with the lack of scalable OWL support, but this is changing.... > |
From: Mitch S. <mit...@be...> - 2007-02-19 09:19:54
|
Sorry for the brain dump earlier--here's a shorter, better-digested version. As I see it, the main point of having a genome wiki is to make genomic data editable. It's important to note that making *data* editable is different from making *documents* editable--I expect data to be interpretable using software, but while documents can be managed by software, actually interpreting them using software is definitely an unsolved problem. The data/document distinction is reflected in the difference between a semantic wiki and a regular wiki--in a semantic wiki the content contains handles for software to grab onto, but the slippery, hard-to-parse natural language content of a non-semantic wiki is much, much harder for software to pull information out of. For data editing, lots of UIs exist already, of course. There's an army of visual basic programmers out there putting editing interfaces in front of relational databases. However, those data-editing UIs (and the databases behind them) are relatively inflexible; if some new situation arises and you want to store some new kind of information then you're SOL until your local programmer can get around to adding support for it. This is the reason for the appalling success of Excel and Access as data-management systems. Having done data-management work in the biological trenches literally right next to the lab benches, I can tell you that this is an ongoing pain point. Flexibility is especially important in a community annotation context, where you want people to be able to add information without having to agree on a data model first. So the semantic wiki and its RDF data model occupy a nice middle ground between fast and efficient but relatively inflexible relational databases and the document-style wiki that's flexible but not really queryable. The data content of a semantic wiki is more useful than pure natural language wiki content because you can pull data out of the semantic wiki and do something with it, like adding graphical decorations to features that have certain kinds of wiki annotations. Generic software that handles RDF (like Piggy Bank) can also make use of the semantic wiki data. To some extent we can have our cake and eat it too by by integrating RDF data stores ("triplestores") with relational databases. You can start out with a fast, efficient relational skeleton that's already supported by lots of software (like chado) and then hang whatever new kinds of information you want off of it. The new kinds of information go into the triplestore, and at query time, data from the relational tables and from the triplestore can be blended together. Over time, I expect some kinds of new information to get better understood. Once there is consensus on how a particular kind of information should be modeled, it can be moved from the triplestore into a set of relational tables. When this happens, it's possible to keep the same client-side RDF view of the data, with the only differences being that the whole system gets faster, and software for processing and analyzing the new data gets easier to write. So, if you buy all this, then IMO the next steps in this area are: 1. Evaluate RDF/relational integration tools. The main contenders appear to be D2R and Virtuoso. D2R is nice because it works with existing databases. Virtuoso is nice because it has good relational/triplestore integration. Whether it's easier to integrate D2R with a triplestore or port chado to Virtuoso is an open question. 2. Get semantic mediawiki to talk to the chosen triplestore. 3. Figure out how the namespaces/idspaces ought to work. We want to have a system that's flat enough that it's easy for people to make wiki links between entities, but deep enough that IDs from various sources/applications don't step on each other. My first priority at the moment is to try and get some kind of persistent feature upload/display working; my hope is that we'll have thought through the IDspace issues by the time we get to implementing that part. Regards, Mitch |
From: Hilmar L. <hl...@gm...> - 2007-02-20 04:19:46
|
It might be interesting to have a look at 'WikiProteins' (of which there only seems to be a flash demo yet): http://www.wikiprofessional.info/ This was featured in a Nature news article. Apparently it's coming out of a company called Knewco: http://www.knewco.com/ The people of that company also presented at KR-MED 2006 ('An Online Ontology: WiktionaryZ'), see http://ontoworld.org/wiki/WiktionaryZ. There is an RDF export. -hilmar On Feb 19, 2007, at 4:19 AM, Mitch Skinner wrote: > Sorry for the brain dump earlier--here's a shorter, better-digested > version. > > As I see it, the main point of having a genome wiki is to make genomic > data editable. It's important to note that making *data* editable is > different from making *documents* editable--I expect data to be > interpretable using software, but while documents can be managed by > software, actually interpreting them using software is definitely an > unsolved problem. The data/document distinction is reflected in the > difference between a semantic wiki and a regular wiki--in a semantic > wiki the content contains handles for software to grab onto, but the > slippery, hard-to-parse natural language content of a non-semantic > wiki > is much, much harder for software to pull information out of. > > For data editing, lots of UIs exist already, of course. There's an > army > of visual basic programmers out there putting editing interfaces in > front of relational databases. However, those data-editing UIs > (and the > databases behind them) are relatively inflexible; if some new > situation > arises and you want to store some new kind of information then you're > SOL until your local programmer can get around to adding support > for it. > This is the reason for the appalling success of Excel and Access as > data-management systems. Having done data-management work in the > biological trenches literally right next to the lab benches, I can > tell > you that this is an ongoing pain point. Flexibility is especially > important in a community annotation context, where you want people > to be > able to add information without having to agree on a data model first. > > So the semantic wiki and its RDF data model occupy a nice middle > ground > between fast and efficient but relatively inflexible relational > databases and the document-style wiki that's flexible but not really > queryable. The data content of a semantic wiki is more useful than > pure > natural language wiki content because you can pull data out of the > semantic wiki and do something with it, like adding graphical > decorations to features that have certain kinds of wiki annotations. > Generic software that handles RDF (like Piggy Bank) can also make > use of > the semantic wiki data. > > To some extent we can have our cake and eat it too by by > integrating RDF > data stores ("triplestores") with relational databases. You can > start > out with a fast, efficient relational skeleton that's already > supported > by lots of software (like chado) and then hang whatever new kinds of > information you want off of it. The new kinds of information go into > the triplestore, and at query time, data from the relational tables > and >> from the triplestore can be blended together. > > Over time, I expect some kinds of new information to get better > understood. Once there is consensus on how a particular kind of > information should be modeled, it can be moved from the triplestore > into > a set of relational tables. When this happens, it's possible to keep > the same client-side RDF view of the data, with the only differences > being that the whole system gets faster, and software for > processing and > analyzing the new data gets easier to write. > > So, if you buy all this, then IMO the next steps in this area are: > > 1. Evaluate RDF/relational integration tools. The main contenders > appear to be D2R and Virtuoso. D2R is nice because it works with > existing databases. Virtuoso is nice because it has good > relational/triplestore integration. Whether it's easier to integrate > D2R with a triplestore or port chado to Virtuoso is an open question. > > 2. Get semantic mediawiki to talk to the chosen triplestore. > > 3. Figure out how the namespaces/idspaces ought to work. We want to > have a system that's flat enough that it's easy for people to make > wiki > links between entities, but deep enough that IDs from various > sources/applications don't step on each other. > > My first priority at the moment is to try and get some kind of > persistent feature upload/display working; my hope is that we'll have > thought through the IDspace issues by the time we get to implementing > that part. > > Regards, > Mitch > > > > ---------------------------------------------------------------------- > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php? > page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== |
From: Ian H. <ih...@be...> - 2007-02-20 23:48:01
|
Mitch, Thanks for all this; I'm a little behind you & Chris on the discussion of RDF and semantic wikis and so on; but I did have a vague hand-wavy comment regarding this: > As I see it, the main point of having a genome wiki is to make genomic > data editable. I would broaden this slightly & say that the main point of having a "genome wiki", whatever that actually ends up being, is to serve community annotation needs, and that "making genomic data editable" is a key step in this direction. There are some important use cases we should look at, illustrating how people are going about doing community annotation in practice. These include... (1) The "AAA wiki" for Drosophila comparative annotation: http://rana.lbl.gov/drosophila/wiki/index.php/Main_Page (2) The honeybee genome project (advanced as a model for community annotation; there is a workshop on this right before CSHL Biology of Genomes; actually going to BOTH could be a really good idea) http://www.genome.org/cgi/content/full/16/11/1329 http://meetings.cshl.edu/meetings/honeyb07.shtml http://meetings.cshl.edu/meetings/genome07.shtml [scratch going to both; Biology of Genomes is oversubscribed] (3) The GONUTS gene ontology wiki: http://gowiki.tamu.edu/GO/wiki/index.php/Main_Page These all offer slightly different perspectives on the problem. The genome annotation projects in particular reveal a wider array of data than just GFF files. There are alignments, protein sequences, GO terms, associations, phenotypes and various other data that need a place to "hang". In my experience one of the problems with wikis is that there are no fixed slots to put things: of course this anarchy is a strength too, but it does make it hard to find stuff. A semantic wiki might help somewhat, in that searching it becomes easier. In any case I view all of these issues as somewhat downstream, as you say: > My first priority at the moment is to try and get some kind of > persistent feature upload/display working; my hope is that we'll have > thought through the IDspace issues by the time we get to implementing > that part. I agree: I think this does all need some thinking through; but if we can make a reasonably robust/intuitive persistent version of GFF upload (or perhaps, eventually, a persistent version of the current "transient" upload functionality that is built into GBrowse, with all its fancy glyph display & grouping options) then we will have made a significant step in framing these questions about richer meta-content. More importantly perhaps, we will have a real tool that could fit into these existing kinds of genome annotation effort, and then we can start to prioritize future improvements in the best possible way: via direct feedback from users. :-) Ian Mitch Skinner wrote: > Sorry for the brain dump earlier--here's a shorter, better-digested > version. > > As I see it, the main point of having a genome wiki is to make genomic > data editable. It's important to note that making *data* editable is > different from making *documents* editable--I expect data to be > interpretable using software, but while documents can be managed by > software, actually interpreting them using software is definitely an > unsolved problem. The data/document distinction is reflected in the > difference between a semantic wiki and a regular wiki--in a semantic > wiki the content contains handles for software to grab onto, but the > slippery, hard-to-parse natural language content of a non-semantic wiki > is much, much harder for software to pull information out of. > > For data editing, lots of UIs exist already, of course. There's an army > of visual basic programmers out there putting editing interfaces in > front of relational databases. However, those data-editing UIs (and the > databases behind them) are relatively inflexible; if some new situation > arises and you want to store some new kind of information then you're > SOL until your local programmer can get around to adding support for it. > This is the reason for the appalling success of Excel and Access as > data-management systems. Having done data-management work in the > biological trenches literally right next to the lab benches, I can tell > you that this is an ongoing pain point. Flexibility is especially > important in a community annotation context, where you want people to be > able to add information without having to agree on a data model first. > > So the semantic wiki and its RDF data model occupy a nice middle ground > between fast and efficient but relatively inflexible relational > databases and the document-style wiki that's flexible but not really > queryable. The data content of a semantic wiki is more useful than pure > natural language wiki content because you can pull data out of the > semantic wiki and do something with it, like adding graphical > decorations to features that have certain kinds of wiki annotations. > Generic software that handles RDF (like Piggy Bank) can also make use of > the semantic wiki data. > > To some extent we can have our cake and eat it too by by integrating RDF > data stores ("triplestores") with relational databases. You can start > out with a fast, efficient relational skeleton that's already supported > by lots of software (like chado) and then hang whatever new kinds of > information you want off of it. The new kinds of information go into > the triplestore, and at query time, data from the relational tables and > from the triplestore can be blended together. > > Over time, I expect some kinds of new information to get better > understood. Once there is consensus on how a particular kind of > information should be modeled, it can be moved from the triplestore into > a set of relational tables. When this happens, it's possible to keep > the same client-side RDF view of the data, with the only differences > being that the whole system gets faster, and software for processing and > analyzing the new data gets easier to write. > > So, if you buy all this, then IMO the next steps in this area are: > > 1. Evaluate RDF/relational integration tools. The main contenders > appear to be D2R and Virtuoso. D2R is nice because it works with > existing databases. Virtuoso is nice because it has good > relational/triplestore integration. Whether it's easier to integrate > D2R with a triplestore or port chado to Virtuoso is an open question. > > 2. Get semantic mediawiki to talk to the chosen triplestore. > > 3. Figure out how the namespaces/idspaces ought to work. We want to > have a system that's flat enough that it's easy for people to make wiki > links between entities, but deep enough that IDs from various > sources/applications don't step on each other. > > My first priority at the moment is to try and get some kind of > persistent feature upload/display working; my hope is that we'll have > thought through the IDspace issues by the time we get to implementing > that part. > > Regards, > Mitch > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax |