From: Ed R. <ero...@ug...> - 2005-06-13 14:09:56
|
I followed along and thought everything was great until you created the state table. If we are going to make a state table, I would recommend finding someplace for it in the schema, preferrably core. What we are creating here is a methodology that all plugins should follow, so we don't want to recreate another case of plugins competing for temp_table names which is even worse that not specifying controlled vocabularies. If checksums and restarts are going to be a standard part of our architecture, than we need to make the entire architecture transparent by making the table a permanent part of the architecture and all plugins should use the same table. The data should remain for the life-time of the version, not the project. i.e. this table should disappear when the data loaded is versioned and passed to the version tables. So long as the data is live, i.e. updatable, you will need this state information. My suggestion is the following: Core.DataDigest Date, Digest, type, primary_key, AlgorithmID (to id the plugin), Algorithm_version. Also, type in this case, is up the plugin. LSF would have two types, Seq and Feats, other plugins could have whatever types they want to checksum. This field does NOT need to be controlled because the key is mutli-column (it includes the AlgID). -ed ---- Original message ---- >Date: Sun, 12 Jun 2005 22:28:48 -0400 >From: Steve Fischer <sfi...@pc...> >Subject: [GUSDEV] using checksums for loading seqs and features >To: gusdev-gusdev <gus...@li...>, an...@ma... > >folks- > >LoadSequencesAndFeatures is a new name for LoadAnnotatedSequences, the >replacement for the GBParser and the TIGR xml and EMBL plugins that Ed >developed. (Aaron felt that "annotated sequences" connoted an >annotation center's output while the plugin is broader than that...) > >Aaron and I have come up a design for using digests (MD5) to help manage >restart and updating. Using this design the logic of the plugin is the >same whether doing an insert, a restart or an update. > >The design requires state in the database. Rather than pollute the GUS >schema with it, the plugin will take as a command line argument the name >of an application specific table that has three columns: digest, type >(seq or feat), primary_key. The table persists for the duration of the >project. We'll call it DigestTable here. DigestTable must also have >a GUS object for itself if we want transaction level robustness. > >We assume for now that the organism we are using isn't too huge, ie, >that we can hold DigestTable in memory. > >SEQUENCES > >Initialization: > - read the digests for the sequences from DigestTable. write them >into a hash, with the digest as a key and the na_sequence_id as the >value. This is the SequenceDigest hash > - read the source_ids for the sequences from GUS, and place them as a >key in a hash, and put their na_sequence_id as value. This is the >SequenceSourceId hash > >For each sequence: > - create the digest as follows: > - unpack all the info from the bioperl sequence object and its >children, but excluding feature children. > - unpack it into a hash, with the name of the attribute as key >and the value as value. > - for weakly typed fields, use the tag name as key and the value >as the value. > - loop through the keys in sorted order (using Perl's sort), and >concatenate the values into a string > - pass the string to the MD5 processor > - create a DigestTable object from the na_sequence_id and the >digest value > - add that object as a child of the NASequence > > - use the digest as an index into the SequenceDigest hash. if it is >found then the sequence record in the db is fine. if it is not found >then either: > - if it is not in the SequenceSourceId hash then it is a new >sequence, in which case we do a normal insert > - otherwise we fall into update logic. We trace the objects >that are associated with this sequence in the database (excluding >features) to get their foreign keys, build up an updated gus object >tree, and submit, letting the object layer handle the update. > > - when we submit the sequence the DigestTable child object will be >submitted as part of the same transaction. > >Because sequences have stable identifiers (source_ids), it is possible >for us to identify a sequence in the database even if some of its values >have changed. this allows us to do a real update and, in theory, to >keep some of the analysis against the sequence if irrelevant bits of it >have changed. > >FEATURES > >Features, however, are different. They don't have stable ids. Nor do >they have alternate keys (no, type and location is not good enough). >This means that if a feature has changed, we have no choice but to take >the delete-and-insert approach to updating. Here is how we do it.... > >Initialization: read from DigestTable and create the FeatureDigest hash >with digest as key and na_feature_id as value. > >Because we are treating a feature tree as a unit, all the features that >are in a tree will have the same digest. They will each have their own >row in the DigestTable. > >For each bioperl feature tree: > - generate a string representation of the feature tree by: > - initializing an empty string to hold the string version of >the feature tree > - recursively traversing the tree in a reproducible way > - for each individual feature (nodes of the tree), get all its >values, sort by tag name, and concatenate to the growing string > - when done recursing, cmake a digest with that string > - use the digest as an index into the FeatureDigestHash > - if we find one or more features, then the feature tree is ok >remove those features from the FeatureDigestHash > - if we don't find any: > - for each feature in the tree, make a new DigestTable >object with the tree's digest and the feature's feature_id. add each >DigestTable object to the corresponding feature > - insert the tree > >When all features have been processed, delete from the database any >feature remaining in the FeatureDigestHash. > >steve > > > > > > >------------------------------------------------------- >This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput >a projector? How fast can you ride your desk chair down the office luge track? >If you want to score the big prize, get to know the little guy. >Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20 >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev ----------------- Ed Robinson Center for Tropical and Emerging Global Diseases University of Georgia, Athens, GA 30602 ero...@ug.../(706)542.1447/254.8883 |