From: Hilmar L. <hl...@ne...> - 2010-03-18 14:28:25
|
Thanks Rutger! Vladimir - can you make sure these scripts and Rutger's documentation below get committed to svn? -hilmar On Mar 18, 2010, at 7:30 AM, Rutger Vos wrote: > Hi all, > > sorry about the late response. Here's how it works, (to the extent > that I've managed to understand MJD's code): there is a "check" > script. This script needs two arguments: a table name (out of which > MJD's code creates a perl ORM object) and an ID in that table. The > script then tries to construct the logically expected subtended object > hierarchy starting from the focal object. Anything unexpected is > written two STDERR. The most useful way to use this is to say "check > Study $studyID". What I've done in the past is to dump all study IDs > to a file "STUDIES", and then running the following shell script: > > #!/bin/bash > studies=`cat STUDIES` > for study in $studies; do > check Study $study 2> $study.err > logfilesize=`wc -l $study.err | cut -f1 -d' '` > if [[ $logfilesize > 0 ]] > then > gzip -9 $study.err > else > rm $study.err > fi > done > > This will create a $studyID.gz file for every inconsistent study. On > closer examination of these, most inconsistencies lead back to only a > handful of problems, mostly related to incomplete repatriation of > objects from dummy study 22 to their destination study. It's therefore > more informative to bin the inconsistencies by category as opposed to > by study. For this, MJD has written a "digester" script. Assuming you > have a directory full of gzipped study reports, you can then run the > following shell script to categorize the reports: > > #!/bin/bash > zips=`ls *.gz` > for zip in $zips; do > gunzip $zip > base=`echo $zip | sed -e 's/\.gz//'` > dir=`echo $base | sed -e 's/\.err//'` > grep '\*' $base | digester -d $dir > gzip -9 $base > cd $dir > logs=`ls *` > for log in $logs; do > cat $log >> ../$log > done > cd ../ > done > > This will create files such as "tree_references_tls_but_its_no", which > lists the PhyloTree objects that reference TaxonLabelSet X, whereas > some of its nodes reference a TaxonLabel that is in TaxonLabelSet Y. > In all these cases, X is still linked to Study 22 (so not repatriated > correctly) while the individual labels and their Y are in the right > place. > > By the way, the "gc" script is to be ignored. The idea was that this > would be a garbage collector that could automatically figure out all > inconsistencies and fix them. MJD never quite completed it and/or > worked up the confidence and courage to let it loose on a live > database. > > Hope this helps, > > Rutger > > On Wed, Mar 17, 2010 at 8:43 PM, Vladimir Gapeyev > <vla...@du...> wrote: >> >> On Mar 17, 2010, at 10:29 AM, Vladimir Gapeyev wrote: >> >>> On Mar 17, 2010, at 10:05 AM, Hilmar Lapp wrote: >>> >>>> Rutger - where do the consistency tests stand (#2899240). >>>> Vladimir is >>>> going to try to run those which exist, but I'm not sure about the >>>> coverage - is it enough to give us any confidence? >>> >>> To add, these are the only things I detected that I guess might have >>> relevance to data consistency checking: >>> treebase-core/src/main/perl/bin/check >>> treebase-core/src/main/perl/check/check >>> treebase-core/src/main/perl/lib/CIPRES/TreeBase >> >> >> Here is what I got. >> >> The two check scripts are actually the same. The only thing I could >> get out of them is printing out contents of an object specified by >> its >> class/table name and an ID. >> >> There is another script, perl/bin/gc. The wiki description for it is >> "Garbage collector, prints out orphaned objects (e.g. trees without >> studies), presumably candidates for deletion." A few excerpts from >> its printout are below -- I am not sure how to interpret them. >> >> Anyone in the know, please point me in the correct direction. >> >> --Vladimir >> >> >> [vg34@treebasedb-dev ConsistencyChecks]$ perl/bin/gc >> Database contains 5392 Analysis items >> Database contains 5397 AnalysisStep items >> Database contains 12378 AnalyzedData items >> Database contains 4579 Matrix items >> Database contains 236604 MatrixRow items >> Database contains 6613 PhyloTree items >> Database contains 557909 PhyloTreeNode items >> Database contains 2454 Study items >> Database contains 168318 TaxonLabel items >> S127 8/8 >> S1801 3/3 >> S71 2/2 >> S1648 2/2 >> S1481 2/2 >> S10309 4/4 >> S10122 2/2 >> S1178 4/4 >> ..... // I suspect it prints out *all* the studies >> * Analysis 4762 >> * Analysis 4764 >> * Analysis 4821 >> * Analysis 4842 >> * AnalysisStep 4821 >> * Matrix 181 >> * Matrix 182 >> * Matrix 183 >> * Matrix 184 >> * Matrix 185 >> * Matrix 186 >> * Matrix 355 >> * Matrix 367 >> * Matrix 990 >> * Matrix 992 >> * Matrix 993 >> * Matrix 994 >> * Matrix 997 >> * Matrix 998 >> * Matrix 999 >> * Matrix 1000 >> * Matrix 1001 >> * Matrix 1617 >> * Matrix 1618 >> * Matrix 1903 >> * Matrix 2146 >> * Matrix 3702 >> * Matrix 4070 >> * Matrix 4110 >> * Matrix 4130 >> * Matrix 4150 >> * Matrix 4227 >> * Matrix 4280 >> * Matrix 4456 >> * Matrix 4528 >> * Matrix 4778 >> * Matrix 4893 >> * MatrixRow 4091 >> * MatrixRow 4092 >> * MatrixRow 4093 >> * MatrixRow 4094 >> * MatrixRow 4095 >> * MatrixRow 4096 >> * MatrixRow 4097 >> .... //Are these the orphans? These are all Analyses and >> Matrices >> from the output, but I skip most MatrixRows, as there are many >> * MatrixRow 234956 >> * MatrixRow 234957 >> * MatrixRow 234958 >> * MatrixRow 234959 >> * PhyloTree 85 >> * PhyloTree 86 >> * PhyloTree 88 >> * PhyloTree 181 >> .... /// It prints out a lot of PhyloTrees, likely all of them >> * PhyloTree 6978 >> * PhyloTree 6979 >> * PhyloTree 6980 >> * PhyloTree 6981 >> * PhyloTreeNode 76327 >> * PhyloTreeNode 76328 >> * PhyloTreeNode 76329 >> * PhyloTreeNode 76330 >> * PhyloTreeNode 76331 >> * PhyloTreeNode 76332 >> ..... >> * PhyloTreeNode 76488 >> * PhyloTreeNode 76489 >> * PhyloTreeNode 76490 >> * PhyloTreeNode 153706 //a sharp jump >> * PhyloTreeNode 153707 >> * PhyloTreeNode 153708 >> * PhyloTreeNode 153709 >> * PhyloTreeNode 153710 >> ..... >> * PhyloTreeNode 559205 >> * PhyloTreeNode 559206 >> * PhyloTreeNode 559207 >> * TaxonLabel 1288 >> * TaxonLabel 1289 >> * TaxonLabel 1290 >> * TaxonLabel 1291 >> * TaxonLabel 1292 >> ....... >> * TaxonLabel 276777 >> * TaxonLabel 276778 >> * TaxonLabel 276779 >> * TaxonLabel 276780 >> * TaxonLabel 276781 >> >> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Treebase-devel mailing list >> Tre...@li... >> https://lists.sourceforge.net/lists/listinfo/treebase-devel >> > > > > -- > Dr. Rutger A. Vos > School of Biological Sciences > Philip Lyle Building, Level 4 > University of Reading > Reading > RG6 6BX > United Kingdom > Tel: +44 (0) 118 378 7535 > http://www.nexml.org > http://rutgervos.blogspot.com > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |