From: Chris M. <cj...@fr...> - 2002-11-27 01:24:16
|
I have some code almost ready to check in. I'm not quite sure where it all fits in yet so I thought I'd run it by you all. This all revolves around the nascent chado schema, info available from www.gmod.org quick summary - multiple modules covering different biological domains. sequence module has the concept of feature as nodes in a feature relationship graph this is the code and the repositories i'm thinking of - bioperl: ======= Bio::SeqIO::chadoxml Bio::SearchIO::chadoxml So far these only go from bioperl objects --> chado, that's what I'm mostly interested in at the moment. It works by turning the bioperl objects into a tree / hierarchical structured tag representation of the chado schema. This tree can be represented as XML (or S-expressions, which I prefer). This leverages a whole bunch of incredibly useful bioperl code for chado. It drops a lot of data at the moment, I plan to add this as I need it note that the chado schema and the corresponding chado xml are not yet stable, but i'm charging on with this anyway, perhaps to the consternation of the other chado developers - sorry guys. bioperl-db: ========== Bio::DB::ChadoSQL::* Not quite there yet. I'm just working on chado-xml -> chado db at the moment. This is actually super simple as the chado-xml tree representation maps almost directly to the schema (there's a few denormalisations in my version of the chado-xml to make it nicer). It's just a matter of recursively descending the tree and updating/inserting based on the unique constraints. I'll also have adaptors (eg SeqAdaptor, SeqFeatureAdaptor), but these will basically be simple wrappers that use the IO classes to make a tree object, then generically store the tree. Personally I quite like this seperation between schema and objects. go: == go2chadoxml this is already checked in to the go-dev repository - takes GO flat files and turns them into a tree representation that maps to the cv module in the chado schema - also computes the closure. the generic chado loader (above) can then take these trees and store them. gmod: (part of the chado repository or independent?) ==== Bio::Chado::Transform::* not sure about the namespace yet. this is a collection of transforms operating over chado tree representations. XSLT-like, but without the XSLT - the transforms are specified in perl, operating over trees rather than objects. the transforms include: * location inference: eg take a feature graph and fill in begin/end coords for non-leaf features (eg transcripts) based on coords of leaf features (eg exons) * coordinate transformation: move a feature from one assembly level to another, or represent redundant locations on multiple assembly levels * feature inference: generate redundant implicit features (eg introns, splice sites, UTR) from explicit features * sequence inference: calculate residues based on feature coordinates and central dogma, taking into account various biological weirdnesses the sequence ontology and chado are designed to cope with - eg transplicing, transcript editing, stop codon readthroughs etc etc another useful transform would be taking a direct mapping of genbank to chado, then turning this into something more usable (eg correctly organising the mRNA and CDS features in a feature graph, mapping to the Sequence Ontology) lots of other transforms possible, for variation features, ontologies, comp analyses, genetic interactions.... I also have some code for viewing chado feature graphs as interconnected coloured bubbles. the idea is you can pipe a bunch of these transforms together to get a tree that is most useful to your application (also useful for building warehouse versions of the database) modules required //////////////// all this code depends on a module Data::Stag that I'm about to upload to CPAN. this is a tree / structured tag module that happens to play well with XML. I also find it to be a nice alternative to using objects and object models, which I have recently taken an aversion to. discussion ////////// rather than spreading all this code over multiple cvs repositories in bioperl and gmod I'm entertaining the idea of collecting them together as a single codebase (I guess the SeqIO and SearchIO should stay in bioperl). There's various motives behind this plan. I'm attempting to get out of software engineering, and anyway, I don't have a great track record supporting my software. I see this code as my own handy data toolkit that I'd like to make available to anyone who would find it useful (as opposed to a Grand Engineering project). Another reason is that this code embodies a programming paradigm that is perhaps a little idiosyncratic to some, particularly the eschewing of object modeling, and the use of a tree data structure alternative to XML. Perhaps I am insane and it all won't work, in which case I don't want to take anyone down with me, unless they are insane too. Besides, I may decide to rewrite the whole thing in lisp halfway through. So maybe it's all a bit experimental for bioperl/gmod? Another thing is my chado-xml/chado-trees may diverge a bit from the official chado xml-dtd/schema. this remains to be seen. the divergence would be purely synctatic, not semantic. A question for Lincoln: if I house something in GMOD do I then have to commit to a certain level of support, documentation for the non-hacker, bug fixes, etc? I *am* fully committed to this for the chado schema, I don't mind doing this for the chado SeqIO and SearchIO, and GO too - but the rest is just my crazy stuff. There is also talk of a chado API and a chado object model and possibly chado UML. I won't be going anywhere near this, but if someone is going to take this on and fully support it, provide bioperl interoperability, base applications around it, etc, then I'd rather steer my code well out of the way, to let this behemoth through to do it's business (I guess this would most likely be in java?). Even if my code becomes a distinct toolkit I'd like to keep the Bio:: namespace Gosh, this email is actually longer than the code itself.... Anyway, I should have some mostly broken code ready to check in next week |