From: Vladimir G. <vla...@du...> - 2010-01-15 15:56:05
|
Pardon me for being so incremental. I does not look like taxonlabels were pruned - the table has 232770 rows at the moment. So, Bill, could you send me the pruning script, as well as the null- >tb1 script (if you recorded it). --Vladimir On Jan 15, 2010, at 10:28 AM, Vladimir Gapeyev wrote: > Thanks! Then Jon and I will try to find and restore from a backup > between Nov 30 and Dec 8-10, when I may have accidentally ran the > tests while setting up my environment. > --Vladimir > > > On Jan 14, 2010, at 11:14 PM, William Piel wrote: > >> >> On Jan 14, 2010, at 4:44 PM, Hilmar Lapp wrote: >> >>> Vladimir - I think you can use the first backup that there is. Or >>> one >>> before the first tests were run. >> >> Maybe the one before the first tests were run would be best. The >> reasoning being that I have made some ad-hoc changes to the data >> (but very few -- it not a big deal if those edits were lost). But >> more importantly, we did make some bigger changes to the data: >> >> - One change we made (some verbiage below about it) occurred around >> Nov 30 of 2009. We had discovered tons of duplicate taxonlabel >> records. I wrote a script to prune these, which drops the number of >> records in taxonlabel from 232,766 to 131,357. (actually, I'm not >> positive that this script was run on the data -- we could still >> have the duplicate records -- I think Rutger said he'd run it, but >> I'm not sure that it happened). >> >> - Another change we made was to take all migrated studies (i.e. >> almost all studies currently in TreeBASE2) in and give them a phony >> "owner" (I think the username "tb1"). These were missing an owner >> because they were migrated in instead of created by a submitter. >> The absence of an owner id was causing (I think) a unit test >> failure. Hence I did a massive update, giving tb1's ID to all >> submissions with a user_id of NULL. >> >> I guess this is just to say that we have, at times, been fixing >> data problems. So the safest backup to resort to is the most recent >> one that does not include the test stuff. >> >> bp >> >> >> >> >> Begin forwarded message: >> >>> I slapped together a script that I think works. I've tested it on >>> a synopsis of TB2 that only includes tables that use taxonlabel_id >>> as a FK. The following zip file contains the perl script and the >>> "before" and "after" versions of the TB2 synopsis. You can find it >>> here: >>> >>> http://www.treebase.org/~piel/taxlabels_fix.zip >>> >>> You can test it by creating a local database and executing the >>> originalsynops.sql file to populate it, then running the >>> fixtaxonlabels.pl script on it. It should remove all duplicate >>> taxonlabels. The number of records in the taxonlabel table drop >>> from 232,766 to 131,357. >>> >>> You'll need to run it on the treebasedev database because I cannot >>> make a backup myself (pg_dump will not work with the username I'm >>> using because the auto-sequences are owned by rvos, which the >>> treebase_app user cannot access). The fixtaxonlabels.pl script >>> was not written using transactions, so you'll definitely want to >>> have a pg_dump backup handy. >>> >>> I think the steps are the following: >>> >>> 1. create an index for the taxonlabel field: >>> >>> CREATE INDEX taxonlabel_i ON taxonlabel USING btree (taxonlabel); >>> >>> 2. make a backup (e.g. with pg_dump) >>> >>> 3. uncomment lines 26-28 and 142 in the fixtaxonlabels.pl script >>> >>> 4. run the fixtaxonlabels.pl script, probably with nohup because >>> it takes about 6 hours on my macbook. >>> >>> I think this should work -- at least I have not noticed any >>> glaring problems. Course, I can't actually test it on real data, >>> only the synopsis. After running this on treebasedev, we should >>> check that the database has not been corrupted -- any problems and >>> we just drop the database and reload from the pg_dump backup. >>> >>> I'm not fully clear why we have a sub_taxonlabel table. At any >>> rate, I ended up deleting all records in sub_taxonlabel that match >>> duplicate records in taxonlabel that require deleting. >>> >>> take care, >>> >>> Bill >> >> ------------------------------------------------------------------------------ >> Throughout its 18-year history, RSA Conference consistently >> attracts the >> world's best and brightest in the field, creating opportunities for >> Conference >> attendees to learn about information security's most important >> issues through >> interactions with peers, luminaries and emerging and established >> companies. >> http://p.sf.net/sfu/rsaconf-dev2dev_______________________________________________ >> Treebase-devel mailing list >> Tre...@li... >> https://lists.sourceforge.net/lists/listinfo/treebase-devel > > ------------------------------------------------------------------------------ > Throughout its 18-year history, RSA Conference consistently attracts > the > world's best and brightest in the field, creating opportunities for > Conference > attendees to learn about information security's most important > issues through > interactions with peers, luminaries and emerging and established > companies. > http://p.sf.net/sfu/rsaconf-dev2dev_______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |