Re: [Treebase-devel] Towards a "pristine" TB instance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jan 14, 2010, at 4:44 PM, Hilmar Lapp wrote:

> Vladimir - I think you can use the first backup that there is. Or one  
> before the first tests were run.

Maybe the one before the first tests were run would be best. The reasoning being that I have made some ad-hoc changes to the data (but very few -- it not a big deal if those edits were lost). But more importantly, we did make some bigger changes to the data:

- One change we made (some verbiage below about it) occurred around Nov 30 of 2009. We had discovered tons of duplicate taxonlabel records. I wrote a script to prune these, which drops the number of records in taxonlabel from 232,766 to 131,357. (actually, I'm not positive that this script was run on the data -- we could still have the duplicate records -- I think Rutger said he'd run it, but I'm not sure that it happened). 

- Another change we made was to take all migrated studies (i.e. almost all studies currently in TreeBASE2) in and give them a phony "owner" (I think the username "tb1").  These were missing an owner because they were migrated in instead of created by a submitter. The absence of an owner id was causing (I think) a unit test failure. Hence I did a massive update, giving tb1's ID to all submissions with a user_id of NULL. 

I guess this is just to say that we have, at times, been fixing data problems. So the safest backup to resort to is the most recent one that does not include the test stuff. 

bp

Begin forwarded message:

> I slapped together a script that I think works. I've tested it on a synopsis of TB2 that only includes tables that use taxonlabel_id as a FK. The following zip file contains the perl script and the "before" and "after" versions of the TB2 synopsis. You can find it here: 
> 
> http://www.treebase.org/~piel/taxlabels_fix.zip
> 
> You can test it by creating a local database and executing the originalsynops.sql file to populate it, then running the fixtaxonlabels.pl script on it. It should remove all duplicate taxonlabels. The number of records in the taxonlabel table drop from 232,766 to 131,357.
> 
> You'll need to run it on the treebasedev database because I cannot make a backup myself (pg_dump will not work with the username I'm using because the auto-sequences are owned by rvos, which the treebase_app user cannot access).  The fixtaxonlabels.pl script was not written using transactions, so you'll definitely want to have a pg_dump backup handy. 
> 
> I think the steps are the following:
> 
> 1. create an index for the taxonlabel field:
> 
> CREATE INDEX taxonlabel_i ON taxonlabel USING btree (taxonlabel);
> 
> 2. make a backup (e.g. with pg_dump)
> 
> 3. uncomment lines 26-28 and 142 in the fixtaxonlabels.pl  script
> 
> 4. run the fixtaxonlabels.pl script, probably with nohup because it takes about 6 hours on my macbook. 
> 
> I think this should work -- at least I have not noticed any glaring problems. Course, I can't actually test it on real data, only the synopsis.  After running this on treebasedev, we should check that the database has not been corrupted -- any problems and we just drop the database and reload from the pg_dump backup.
> 
> I'm not fully clear why we have a sub_taxonlabel table. At any rate, I ended up deleting all records in sub_taxonlabel that match duplicate records in taxonlabel that require deleting. 
> 
> take care,
> 
> Bill