|
From: Steve F. <sfi...@pc...> - 2005-06-13 02:28:35
|
folks-
LoadSequencesAndFeatures is a new name for LoadAnnotatedSequences, the
replacement for the GBParser and the TIGR xml and EMBL plugins that Ed
developed. (Aaron felt that "annotated sequences" connoted an
annotation center's output while the plugin is broader than that...)
Aaron and I have come up a design for using digests (MD5) to help manage
restart and updating. Using this design the logic of the plugin is the
same whether doing an insert, a restart or an update.
The design requires state in the database. Rather than pollute the GUS
schema with it, the plugin will take as a command line argument the name
of an application specific table that has three columns: digest, type
(seq or feat), primary_key. The table persists for the duration of the
project. We'll call it DigestTable here. DigestTable must also have
a GUS object for itself if we want transaction level robustness.
We assume for now that the organism we are using isn't too huge, ie,
that we can hold DigestTable in memory.
SEQUENCES
Initialization:
- read the digests for the sequences from DigestTable. write them
into a hash, with the digest as a key and the na_sequence_id as the
value. This is the SequenceDigest hash
- read the source_ids for the sequences from GUS, and place them as a
key in a hash, and put their na_sequence_id as value. This is the
SequenceSourceId hash
For each sequence:
- create the digest as follows:
- unpack all the info from the bioperl sequence object and its
children, but excluding feature children.
- unpack it into a hash, with the name of the attribute as key
and the value as value.
- for weakly typed fields, use the tag name as key and the value
as the value.
- loop through the keys in sorted order (using Perl's sort), and
concatenate the values into a string
- pass the string to the MD5 processor
- create a DigestTable object from the na_sequence_id and the
digest value
- add that object as a child of the NASequence
- use the digest as an index into the SequenceDigest hash. if it is
found then the sequence record in the db is fine. if it is not found
then either:
- if it is not in the SequenceSourceId hash then it is a new
sequence, in which case we do a normal insert
- otherwise we fall into update logic. We trace the objects
that are associated with this sequence in the database (excluding
features) to get their foreign keys, build up an updated gus object
tree, and submit, letting the object layer handle the update.
- when we submit the sequence the DigestTable child object will be
submitted as part of the same transaction.
Because sequences have stable identifiers (source_ids), it is possible
for us to identify a sequence in the database even if some of its values
have changed. this allows us to do a real update and, in theory, to
keep some of the analysis against the sequence if irrelevant bits of it
have changed.
FEATURES
Features, however, are different. They don't have stable ids. Nor do
they have alternate keys (no, type and location is not good enough).
This means that if a feature has changed, we have no choice but to take
the delete-and-insert approach to updating. Here is how we do it....
Initialization: read from DigestTable and create the FeatureDigest hash
with digest as key and na_feature_id as value.
Because we are treating a feature tree as a unit, all the features that
are in a tree will have the same digest. They will each have their own
row in the DigestTable.
For each bioperl feature tree:
- generate a string representation of the feature tree by:
- initializing an empty string to hold the string version of
the feature tree
- recursively traversing the tree in a reproducible way
- for each individual feature (nodes of the tree), get all its
values, sort by tag name, and concatenate to the growing string
- when done recursing, cmake a digest with that string
- use the digest as an index into the FeatureDigestHash
- if we find one or more features, then the feature tree is ok
remove those features from the FeatureDigestHash
- if we don't find any:
- for each feature in the tree, make a new DigestTable
object with the tree's digest and the feature's feature_id. add each
DigestTable object to the corresponding feature
- insert the tree
When all features have been processed, delete from the database any
feature remaining in the FeatureDigestHash.
steve
|