Re: [GMOD-devel] gmod_update_gff.pl?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Scott, Kara, Robert, Don and other interested,

> Sorry for the delay in answering.  I don't think anyone is working on
> such a thing, though it did come up during the GMOD meeting last week.
> It feels to me like the best thing to do would be to build a loader
> based on Eric's ModWare.  Then the loader/updater could build a bioperl
> object based on a line of GFF, look in the database for a matching
> object (based on whatever parameters the user wanted, like name, type,
> and coordinates I suppose) and do an update or create as necessary.
> 
> Of course, there are several things complicating matters a bit, like how
> exactly to determine that two objects are sufficiently 'the same' in
> order to allow an update.  How does one make that decision?  How do you
> report to the user that what was updated and what was inserted as brand
> new stuff?  What if it does the wrong thing?  How does the user ever
> find that out that the wrong thing happened?  I could be a real can of
> worms.
> 
> OK, ignoring what I wrote above, it might be better if there were a
> GFF->chadoxml converter, and then use XORT to load the results, and then
> let XORT worry about the details.
> 
> Scott

incremental update is always painful task for any database. Hope our
experiences hope could be help to others.
1. All our update archived in chado XML
2. update chado by objects. So far, we had successfully incrementally
updated our bibliography data (pub, pubauthor, putprop, pub_dbxref etc
in chado). 

updates can be grouped in four categories:
1. add extra information
   this is simple, XORT will insert those records.
2. update database WITHOUT adding/deleting records, in other word, you
only update non-unique key(s) field.
   this also is simple, just create chadoXML as if you initiate insert,
XORT will take care of update.
   For example, if you want to update feature.seqlen, then create
chadoXML for this feature with new seqlen

3. update database by altering unique key(s) of existing record. This is
what scott call 'the same' object to different object.
4. delete 'extra' information. This include remove additional
'object'(for instance, remove a gene, a transcript of gene), or
additional bit information of existing record (remove residues of
feature). 

As you can image, the hard part is how to figure out those 'extra'
information (#3 and #4)
Take the gff update for example, I can think of the following steps:
1. load new gff into a separate instance (via gff loader ?).
2. use XORT to dump out the object and its related information from BOTH
new and old instance, dump each object into one file
3. XORT's XORTDiff script to compare those two files for same object, it
can tell what extra 'object' in OLD instance(we don't need to worry
about extra objects in new instance, because those are objects we want
to keep). then write script to remove those 'extra' objects before or
after you loading the new data set.
4. convert gff into static chadoxml(here you didn't need to worry about
update again), and load into existing instance using XORT. 

what miss here is: 
1. converter gff to chadoxml
2. auto-generate operational chadoxml when comparing two chadoxml files
(plan to do sometime). 

make sense ?

pinglei

> 
> On Wed, 2007-01-17 at 14:54 -0500, Kara Dolinski wrote:
> > Hey,
> > 
> > I think this came up on the list before, but I couldn't seem to find  
> > the thread.  Does anyone have (or is working on) a script that  
> > updates a chado database from a gff3 file, just making the updates as  
> > necessary, rather than deleting everything and re-loading, as the  
> > current loaders do?
> > 
> > Thanks,
> > Kara
> > 
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share your
> > opinions on IT & business topics through brief surveys - and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > _______________________________________________
> > Gmod-devel mailing list
> > Gmo...@li...
> > https://lists.sourceforge.net/lists/listinfo/gmod-devel
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                         ca...@cs...
> GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
> Cold Spring Harbor Laboratory
> 
> ______________________________________________________________________
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> 
> ______________________________________________________________________
> _______________________________________________
> Gmod-devel mailing list
> Gmo...@li...
> https://lists.sourceforge.net/lists/listinfo/gmod-devel