[GUSDEV] using checksums for loading seqs and features

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

folks-

LoadSequencesAndFeatures is a new name for LoadAnnotatedSequences, the 
replacement for the GBParser and the TIGR xml and EMBL plugins that Ed 
developed.  (Aaron felt that "annotated sequences" connoted an 
annotation center's output while the plugin is broader than that...)

Aaron and I have come up a design for using digests (MD5) to help manage 
restart and updating.   Using this design the logic of the plugin is the 
same whether doing an insert, a restart or an update.

The design requires state in the database.  Rather than pollute the GUS 
schema with it, the plugin will take as a command line argument the name 
of an application specific table that has three columns: digest, type 
(seq or feat), primary_key.  The table persists for the duration of the 
project.   We'll call it DigestTable here.   DigestTable must also have 
a GUS object for itself if we want transaction level robustness.

We assume for now that the organism we are using isn't too huge, ie, 
that we can hold DigestTable in memory.

SEQUENCES

Initialization:
  - read the digests for the sequences from DigestTable.  write them 
into a hash, with the digest as a key and the na_sequence_id as the 
value.   This is the SequenceDigest hash
  - read the source_ids for the sequences from GUS, and place them as a 
key in a hash, and put their na_sequence_id as value.  This is the 
SequenceSourceId hash

For each sequence:
  - create the digest as follows:
        - unpack all the info from the bioperl sequence object and its 
children, but excluding feature children.
        - unpack it into a hash, with the name of the attribute as key 
and the value as value. 
        - for weakly typed fields, use the tag name as key and the value 
as the value.
        - loop through the keys in sorted order (using Perl's sort), and 
concatenate the values into a string
        - pass the string to the MD5 processor
        - create a DigestTable object from the na_sequence_id and the 
digest value
        - add that object as a child of the NASequence

  - use the digest as an index into the SequenceDigest hash.  if it is 
found then the sequence record in the db is fine.   if it is not found 
then either:
          - if it is not in the SequenceSourceId hash then it is a new 
sequence, in which case we do a normal insert
          - otherwise we fall into update logic.   We trace the objects 
that are associated with this sequence in the database (excluding 
features) to get their foreign keys, build up an updated gus object 
tree, and submit, letting the object layer handle the update.

   - when we submit the sequence the DigestTable child object will be 
submitted as part of the same transaction.

Because sequences have stable identifiers (source_ids), it is possible 
for us to identify a sequence in the database even if some of its values 
have changed.   this allows us to do a real update and, in theory, to 
keep some of the analysis against the sequence if irrelevant bits of it 
have changed.

FEATURES

Features, however, are different.   They don't have stable ids.   Nor do 
they have alternate keys (no, type and location is not good enough).  
This means that if a feature has changed, we have no choice but to take 
the delete-and-insert approach to updating.   Here is how we do it....

Initialization:  read from DigestTable and create the FeatureDigest hash 
with digest as key and na_feature_id as value.

Because we are treating a feature tree as a unit, all the features that 
are in a tree will have the same digest.   They will each have their own 
row in the DigestTable.

For each bioperl feature tree:
   - generate a string representation of the feature tree by:
          - initializing an empty string to hold the string version of 
the feature tree
          - recursively traversing the tree in a reproducible way
          - for each individual feature (nodes of the tree), get all its 
values, sort by tag name, and concatenate to the growing string
   - when done recursing, cmake a digest with that string
   - use the digest as an index into the FeatureDigestHash
       - if  we find one or more features, then the feature tree is ok   
remove those features from the FeatureDigestHash
       - if we don't find any:
              - for each feature in the tree, make a new DigestTable 
object with the tree's digest and the feature's feature_id.   add each 
DigestTable object to the corresponding feature
              - insert the tree

When all features have been processed, delete from the database any 
feature remaining in the FeatureDigestHash.

steve