Re: [GUSDEV] using checksums for loading seqs and features

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sorry for the late reply on this, but I  would like to put this 
conversation in a proper context. Let's review what the GBParser 
currently does:

First, GBParser has no function for restarting other than looping 
through the records until it finds one that needs update or insert.

Any GB record has a modification date associated with the record. 
Accessions and modification dates are stored in the NAEntry table. Any 
GB record that does not match the data stored in NAEntry get put through 
the update process, all others get skipped.

A GUS object trees are made from the database entry (dbTree) and the 
record in the flatfile (ffTree). Each feature of the dbTree is tried to 
be matched to each feature of the ffTree by scoring how close the values 
are. Perfect matches are deleted from the ffTree. Any feature not 
matched in the dbTree is marked deleted. Any feature left in the ffTree 
is added to dbTree.

We need the new algorithm because:
1- This matching is not optimal, and a MD5SUM would come in very handy.
2- We cannot rely on other external DBs to provide modification dates, 
hence the need for the checksum on the sequence entry.

(more comments below)
Steve Fischer wrote:

> i am not persuaded that this functionality will be used by many other 
> plugins.   Most do inserts, not updates.   And, many that do updates 
> are given difference files and have stable identifiers, so the 
> problems of this plugin don't apply.
>
If what steve says is true, then we do not need a table to store 
md5sums, since GB entries can rely on the modification date in the 
NAEntry table to designate when there should be an update operation. I 
would like to know at least one other plugin that using a checksum table 
would be useful for in order to see some value for a database table.

> as far as the schema is concerned, you've reminded me that i left 
> something out.   we need to have a fourth column, giving this:
>   digest, primary_key, type, ext_db_rls_id

> the ext_db_rls_id differentiates different datasets stored in the table.
>
I don't think so. The primary key will change across different different 
datasets (e.g. external_db_rls_ids).

So to summarize, I am not convinced that a table is needed more than the 
current load process to enable re-starts. If this is true, then a simple 
flat file log will do.

I am also not convinced that we need this file at all if we can 
efficiently compute these values on the fly from the DB and flat file 
entries.

Last, before we go through the trouble of implementing this, I would 
like to see it be useful for other plugins.

-angel