From: <pi...@pc...> - 2004-10-12 19:51:18
|
I'm soliciting opinions on the issue of rewriting LoadTaxon.pm because of the following problem. The LoadTaxon plugin is designed to load taxon information from NCBI into multiple tables including sres.taxon and sres.taxonname. On occassion tax_ids are deleted or old tax_ids are replaced with new tax_ids. The merged.dmp and delnodes.dmp files contained in the same tar ball are not always consistent with these changes. For instance, for the following example, neither tax_id was in any available merged.dmp file (up through July) and only appeared in a delnodes file in a later tar ball. An example: Lactobacillus kefiranofaciens: tax_id = 190905 through March/2004 267818 April/2004 and later Written with the assumption that tax_ids were relatively stable, the LoadTaxon plugin deletes only taxonname entries when the row for a specific taxon_id and name_class has been replaced. Taxon rows are updated but never deleted. This is not really an acceptable approach because eventually there is an accumulation of rows in taxonname that duplicate name with the same name_class but different taxon_ids. I think the LoadTaxon plugin should be rewritten to avoid duplications but to retain taxon_id stability. The plugin should not delete taxon rows because of the obvious referencing constraint problems but should update the rows as it does now but including an update of ncbi_tax_id. Does anyone have any suggestions or insight on this subject? |