[Treebase-devel] Data Migration: TB1 dump and TI mapping

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This is mainly for Vladimir's benefit:

I have created the dump file for the migration here:

http://www.treebase.org/treebase/migration/Dec-09/dump_Dec09_utf8.zip

This only contains metadata about studies added since Jan 09. As this file is being parsed by Mark Jason's migration scripts, whenever a matrix or analysis result is referenced, the scripts will go fetch either a matrix or a tree file and pass it through headless Mesquite. The matrix and tree files are all in these directories: 

http://www.treebase.org/treebase/migration/Dec-09/trees.zip
http://www.treebase.org/treebase/migration/Dec-09/characters.zip

So for the scripts to work, they must access both the dump_Dec09_utf8.txt file and the trees/ and characters/ directories. The trees/ and characters/ directories actually contain *all* trees and *all* matrices in TreeBASE1, even if the dump_Dec09_utf8.txt file will only reference those that are new to TreeBASE2. 

Once these data have been imported, the taxon_variant and taxa tables can be imported and the taxon_labels (that were generated as a result of the headless Mesquite parsing the various files) can each be linked to their respective taxon_variant record (though not all are linkable -- some 25% are orphaned). The general model is this:

[taxonlabel] >-- [taxonvariant] >-- [taxon]

i.e., many taxon label records map to a taxon variant record, and many taxon variant records map to a taxon record. The contents of the taxonlabel table are generated by extracting labels from parsed tree files and parsed matrix files. 

You can get the taxon intel files here:

http://www.treebase.org/treebase/migration/Dec-09/TI_for_Dec09_utf8.zip

The tables that these data will go in are scoped to the entire database, so this action should not be viewed as incremental, but rather as a complete refresh. To begin with, I think the taxonvariant_id column in the taxonlabel table should be set to NULL for all records because these values will be completely refreshed. The taxonvariant table and taxon table should be erased and replaced with my new data from TI_for_Dec09_utf8.zip. And then Mark Jason's scripts should go through each value in the taxonlabel column of the taxonlabel table, look it up in my taxon_labels.tab file (which is compressed inside  TI_for_Dec09_utf8.zip), and use the taxon_variant_id to map the taxonlabel table to the taxonvariant table. 

I don't know how Mark Jason has implemented this exactly, but it seems to me that the only way for this to work is if TreeBASE2 uses my taxon_variant_id and taxon_id values instead of autoincrementing its own, seeing as my  taxon_labels.tab file is key for mapping the taxonlabel table with the taxonvariant table. This is something to lookout for, else if IDs are created de novo, it will be hard to run through the taxonlabel table and know what value to put in the taxonvariant_id FK. 

-- Note that we will probably also do a final Jan10 migration, to deal with data that has been added since early December. But this should not take long. 

-- Finally, of course, keep in mind that the "dev" data is actually "production" data, so be sure to do a pg_dump before running any data migration scripts. 

regards,

Bill