From: Steve F. <sfi...@pc...> - 2005-06-13 02:28:35
|
folks- LoadSequencesAndFeatures is a new name for LoadAnnotatedSequences, the replacement for the GBParser and the TIGR xml and EMBL plugins that Ed developed. (Aaron felt that "annotated sequences" connoted an annotation center's output while the plugin is broader than that...) Aaron and I have come up a design for using digests (MD5) to help manage restart and updating. Using this design the logic of the plugin is the same whether doing an insert, a restart or an update. The design requires state in the database. Rather than pollute the GUS schema with it, the plugin will take as a command line argument the name of an application specific table that has three columns: digest, type (seq or feat), primary_key. The table persists for the duration of the project. We'll call it DigestTable here. DigestTable must also have a GUS object for itself if we want transaction level robustness. We assume for now that the organism we are using isn't too huge, ie, that we can hold DigestTable in memory. SEQUENCES Initialization: - read the digests for the sequences from DigestTable. write them into a hash, with the digest as a key and the na_sequence_id as the value. This is the SequenceDigest hash - read the source_ids for the sequences from GUS, and place them as a key in a hash, and put their na_sequence_id as value. This is the SequenceSourceId hash For each sequence: - create the digest as follows: - unpack all the info from the bioperl sequence object and its children, but excluding feature children. - unpack it into a hash, with the name of the attribute as key and the value as value. - for weakly typed fields, use the tag name as key and the value as the value. - loop through the keys in sorted order (using Perl's sort), and concatenate the values into a string - pass the string to the MD5 processor - create a DigestTable object from the na_sequence_id and the digest value - add that object as a child of the NASequence - use the digest as an index into the SequenceDigest hash. if it is found then the sequence record in the db is fine. if it is not found then either: - if it is not in the SequenceSourceId hash then it is a new sequence, in which case we do a normal insert - otherwise we fall into update logic. We trace the objects that are associated with this sequence in the database (excluding features) to get their foreign keys, build up an updated gus object tree, and submit, letting the object layer handle the update. - when we submit the sequence the DigestTable child object will be submitted as part of the same transaction. Because sequences have stable identifiers (source_ids), it is possible for us to identify a sequence in the database even if some of its values have changed. this allows us to do a real update and, in theory, to keep some of the analysis against the sequence if irrelevant bits of it have changed. FEATURES Features, however, are different. They don't have stable ids. Nor do they have alternate keys (no, type and location is not good enough). This means that if a feature has changed, we have no choice but to take the delete-and-insert approach to updating. Here is how we do it.... Initialization: read from DigestTable and create the FeatureDigest hash with digest as key and na_feature_id as value. Because we are treating a feature tree as a unit, all the features that are in a tree will have the same digest. They will each have their own row in the DigestTable. For each bioperl feature tree: - generate a string representation of the feature tree by: - initializing an empty string to hold the string version of the feature tree - recursively traversing the tree in a reproducible way - for each individual feature (nodes of the tree), get all its values, sort by tag name, and concatenate to the growing string - when done recursing, cmake a digest with that string - use the digest as an index into the FeatureDigestHash - if we find one or more features, then the feature tree is ok remove those features from the FeatureDigestHash - if we don't find any: - for each feature in the tree, make a new DigestTable object with the tree's digest and the feature's feature_id. add each DigestTable object to the corresponding feature - insert the tree When all features have been processed, delete from the database any feature remaining in the FeatureDigestHash. steve |