From: Angel P. <an...@ma...> - 2005-06-14 17:55:41
|
Sorry for the late reply on this, but I would like to put this conversation in a proper context. Let's review what the GBParser currently does: First, GBParser has no function for restarting other than looping through the records until it finds one that needs update or insert. Any GB record has a modification date associated with the record. Accessions and modification dates are stored in the NAEntry table. Any GB record that does not match the data stored in NAEntry get put through the update process, all others get skipped. A GUS object trees are made from the database entry (dbTree) and the record in the flatfile (ffTree). Each feature of the dbTree is tried to be matched to each feature of the ffTree by scoring how close the values are. Perfect matches are deleted from the ffTree. Any feature not matched in the dbTree is marked deleted. Any feature left in the ffTree is added to dbTree. We need the new algorithm because: 1- This matching is not optimal, and a MD5SUM would come in very handy. 2- We cannot rely on other external DBs to provide modification dates, hence the need for the checksum on the sequence entry. (more comments below) Steve Fischer wrote: > i am not persuaded that this functionality will be used by many other > plugins. Most do inserts, not updates. And, many that do updates > are given difference files and have stable identifiers, so the > problems of this plugin don't apply. > If what steve says is true, then we do not need a table to store md5sums, since GB entries can rely on the modification date in the NAEntry table to designate when there should be an update operation. I would like to know at least one other plugin that using a checksum table would be useful for in order to see some value for a database table. > as far as the schema is concerned, you've reminded me that i left > something out. we need to have a fourth column, giving this: > digest, primary_key, type, ext_db_rls_id > the ext_db_rls_id differentiates different datasets stored in the table. > I don't think so. The primary key will change across different different datasets (e.g. external_db_rls_ids). So to summarize, I am not convinced that a table is needed more than the current load process to enable re-starts. If this is true, then a simple flat file log will do. I am also not convinced that we need this file at all if we can efficiently compute these values on the fly from the DB and flat file entries. Last, before we go through the trouble of implementing this, I would like to see it be useful for other plugins. -angel |