Re: [GUSDEV] using checksums for loading seqs and features

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

i am not persuaded that this functionality will be used by many other 
plugins.   Most do inserts, not updates.   And, many that do updates are 
given difference files and have stable identifiers, so the problems of 
this plugin don't apply.

We'll start out with the table in App space, and when we have a second 
plugin that needs it we'll move it over to core.

as far as the schema is concerned, you've reminded me that i left 
something out.   we need to have a fourth column, giving this:
   digest, primary_key, type, ext_db_rls_id

the ext_db_rls_id differentiates different datasets stored in the table.

you mentioned data and algorithm stuff.  that is handled by the standard 
GUS overhead rows, which this table must have if its going to be a GUS 
object.

i am confused by your proposal that the life-time of the digest data is 
the life-time of the version.   data is versioned continually, whenever 
it is modified.  I said the lifetime of the project.   By that i mean 
the digest data must remain as long as it possible that there may be any 
more updates.   (maybe that's what you meant)

steve

Ed Robinson wrote:

>I followed along and thought everything was great until you
>created the state table.  If we are going to make a state
>table, I would recommend finding someplace for it in the
>schema, preferrably core.  What we are creating here is a
>methodology that all plugins should follow, so we don't want
>to recreate another case of plugins competing for temp_table
>names which is even worse that not specifying controlled
>vocabularies.
>
>If checksums and restarts are going to be a standard part of
>our architecture, than we need to make the entire architecture
>transparent by making the table a permanent part of the
>architecture and all plugins should use the same table.  The
>data should remain for the life-time of the version, not the
>project.  i.e. this table should disappear when the data
>loaded is versioned and passed to the version tables.  So long
>as the data is live, i.e. updatable, you will need this state
>information.  My suggestion is the following:
>
>Core.DataDigest
>Date, Digest, type, primary_key, AlgorithmID (to id the
>plugin), Algorithm_version. 
>
>Also, type in this case, is up the plugin. LSF would have two
>types, Seq and Feats, other plugins could have whatever types
>they want to checksum.  This field does NOT need to be
>controlled because the key is mutli-column (it includes the
>AlgID).
>
>-ed
>
>
>
>---- Original message ----
>  
>
>>Date: Sun, 12 Jun 2005 22:28:48 -0400
>>From: Steve Fischer <sfi...@pc...>  
>>Subject: [GUSDEV] using checksums for loading seqs and features  
>>To: gusdev-gusdev <gus...@li...>,
>>    
>>
>an...@ma...
>  
>
>>folks-
>>
>>LoadSequencesAndFeatures is a new name for
>>    
>>
>LoadAnnotatedSequences, the 
>  
>
>>replacement for the GBParser and the TIGR xml and EMBL
>>    
>>
>plugins that Ed 
>  
>
>>developed.  (Aaron felt that "annotated sequences" connoted an 
>>annotation center's output while the plugin is broader than
>>    
>>
>that...)
>  
>
>>Aaron and I have come up a design for using digests (MD5) to
>>    
>>
>help manage 
>  
>
>>restart and updating.   Using this design the logic of the
>>    
>>
>plugin is the 
>  
>
>>same whether doing an insert, a restart or an update.
>>
>>The design requires state in the database.  Rather than
>>    
>>
>pollute the GUS 
>  
>
>>schema with it, the plugin will take as a command line
>>    
>>
>argument the name 
>  
>
>>of an application specific table that has three columns:
>>    
>>
>digest, type 
>  
>
>>(seq or feat), primary_key.  The table persists for the
>>    
>>
>duration of the 
>  
>
>>project.   We'll call it DigestTable here.   DigestTable must
>>    
>>
>also have 
>  
>
>>a GUS object for itself if we want transaction level robustness.
>>
>>We assume for now that the organism we are using isn't too
>>    
>>
>huge, ie, 
>  
>
>>that we can hold DigestTable in memory.
>>
>>SEQUENCES
>>
>>Initialization:
>> - read the digests for the sequences from DigestTable. 
>>    
>>
>write them 
>  
>
>>into a hash, with the digest as a key and the na_sequence_id
>>    
>>
>as the 
>  
>
>>value.   This is the SequenceDigest hash
>> - read the source_ids for the sequences from GUS, and place
>>    
>>
>them as a 
>  
>
>>key in a hash, and put their na_sequence_id as value.  This
>>    
>>
>is the 
>  
>
>>SequenceSourceId hash
>>
>>For each sequence:
>> - create the digest as follows:
>>       - unpack all the info from the bioperl sequence
>>    
>>
>object and its 
>  
>
>>children, but excluding feature children.
>>       - unpack it into a hash, with the name of the
>>    
>>
>attribute as key 
>  
>
>>and the value as value. 
>>       - for weakly typed fields, use the tag name as key
>>    
>>
>and the value 
>  
>
>>as the value.
>>       - loop through the keys in sorted order (using Perl's
>>    
>>
>sort), and 
>  
>
>>concatenate the values into a string
>>       - pass the string to the MD5 processor
>>       - create a DigestTable object from the na_sequence_id
>>    
>>
>and the 
>  
>
>>digest value
>>       - add that object as a child of the NASequence
>> 
>> - use the digest as an index into the SequenceDigest hash.
>>    
>>
> if it is 
>  
>
>>found then the sequence record in the db is fine.   if it is
>>    
>>
>not found 
>  
>
>>then either:
>>         - if it is not in the SequenceSourceId hash then it
>>    
>>
>is a new 
>  
>
>>sequence, in which case we do a normal insert
>>         - otherwise we fall into update logic.   We trace
>>    
>>
>the objects 
>  
>
>>that are associated with this sequence in the database
>>    
>>
>(excluding 
>  
>
>>features) to get their foreign keys, build up an updated gus
>>    
>>
>object 
>  
>
>>tree, and submit, letting the object layer handle the update.
>>
>>  - when we submit the sequence the DigestTable child object
>>    
>>
>will be 
>  
>
>>submitted as part of the same transaction.
>>
>>Because sequences have stable identifiers (source_ids), it is
>>    
>>
>possible 
>  
>
>>for us to identify a sequence in the database even if some of
>>    
>>
>its values 
>  
>
>>have changed.   this allows us to do a real update and, in
>>    
>>
>theory, to 
>  
>
>>keep some of the analysis against the sequence if irrelevant
>>    
>>
>bits of it 
>  
>
>>have changed.
>>
>>FEATURES
>>
>>Features, however, are different.   They don't have stable
>>    
>>
>ids.   Nor do 
>  
>
>>they have alternate keys (no, type and location is not good
>>    
>>
>enough).  
>  
>
>>This means that if a feature has changed, we have no choice
>>    
>>
>but to take 
>  
>
>>the delete-and-insert approach to updating.   Here is how we
>>    
>>
>do it....
>  
>
>>Initialization:  read from DigestTable and create the
>>    
>>
>FeatureDigest hash 
>  
>
>>with digest as key and na_feature_id as value.
>>
>>Because we are treating a feature tree as a unit, all the
>>    
>>
>features that 
>  
>
>>are in a tree will have the same digest.   They will each
>>    
>>
>have their own 
>  
>
>>row in the DigestTable.
>>
>>For each bioperl feature tree:
>>  - generate a string representation of the feature tree by:
>>         - initializing an empty string to hold the string
>>    
>>
>version of 
>  
>
>>the feature tree
>>         - recursively traversing the tree in a reproducible way
>>         - for each individual feature (nodes of the tree),
>>    
>>
>get all its 
>  
>
>>values, sort by tag name, and concatenate to the growing string
>>  - when done recursing, cmake a digest with that string
>>  - use the digest as an index into the FeatureDigestHash
>>      - if  we find one or more features, then the feature
>>    
>>
>tree is ok   
>  
>
>>remove those features from the FeatureDigestHash
>>      - if we don't find any:
>>             - for each feature in the tree, make a new
>>    
>>
>DigestTable 
>  
>
>>object with the tree's digest and the feature's feature_id. 
>>    
>>
> add each 
>  
>
>>DigestTable object to the corresponding feature
>>             - insert the tree
>>
>>When all features have been processed, delete from the
>>    
>>
>database any 
>  
>
>>feature remaining in the FeatureDigestHash.
>>
>>steve
>>
>>
>>
>>
>>
>>
>>-------------------------------------------------------
>>This SF.Net email is sponsored by: NEC IT Guy Games.  How far
>>    
>>
>can you shotput
>  
>
>>a projector? How fast can you ride your desk chair down the
>>    
>>
>office luge track?
>  
>
>>If you want to score the big prize, get to know the little guy.  
>>Play to win an NEC 61" plasma display:
>>    
>>
>http://www.necitguy.com/?r=20
>  
>
>>_______________________________________________
>>Gusdev-gusdev mailing list
>>Gus...@li...
>>https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>    
>>
>-----------------
>Ed Robinson
>Center for Tropical and Emerging Global Diseases
>University of Georgia, Athens, GA 30602
>ero...@ug.../(706)542.1447/254.8883
>  
>