Re: [Treebase-devel] Consistency tests...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks Rutger! Vladimir - can you make sure these scripts and Rutger's  
documentation below get committed to svn?

	-hilmar

On Mar 18, 2010, at 7:30 AM, Rutger Vos wrote:

> Hi all,
>
> sorry about the late response. Here's how it works, (to the extent
> that I've managed to understand MJD's code): there is a "check"
> script. This script needs two arguments: a table name (out of which
> MJD's code creates a perl ORM object) and an ID in that table. The
> script then tries to construct the logically expected subtended object
> hierarchy starting from the focal object. Anything unexpected is
> written two STDERR. The most useful way to use this is to say "check
> Study $studyID". What I've done in the past is to dump all study IDs
> to a file "STUDIES", and then running the following shell script:
>
> #!/bin/bash
> studies=`cat STUDIES`
> for study in $studies; do
> 	check Study $study 2> $study.err
> 	logfilesize=`wc -l $study.err | cut -f1 -d' '`
> 	if [[ $logfilesize > 0 ]]
> 	then
> 		gzip -9 $study.err
> 	else
> 		rm $study.err
> 	fi
> done
>
> This will create a $studyID.gz file for every inconsistent study. On
> closer examination of these, most inconsistencies lead back to only a
> handful of problems, mostly related to incomplete repatriation of
> objects from dummy study 22 to their destination study. It's therefore
> more informative to bin the inconsistencies by category as opposed to
> by study. For this, MJD has written a "digester" script. Assuming you
> have a directory full of gzipped study reports, you can then run the
> following shell script to categorize the reports:
>
> #!/bin/bash
> zips=`ls *.gz`
> for zip in $zips; do
> 	gunzip $zip
> 	base=`echo $zip | sed -e 's/\.gz//'`
> 	dir=`echo $base | sed -e 's/\.err//'`
> 	grep '\*' $base | digester -d $dir
> 	gzip -9 $base
> 	cd $dir
> 	logs=`ls *`
> 	for log in $logs; do
> 		cat $log >> ../$log
> 	done
> 	cd ../
> done
>
> This will create files such as "tree_references_tls_but_its_no", which
> lists the PhyloTree objects that reference TaxonLabelSet X, whereas
> some of its nodes reference a TaxonLabel that is in TaxonLabelSet Y.
> In all these cases, X is still linked to Study 22 (so not repatriated
> correctly) while the individual labels and their Y are in the right
> place.
>
> By the way, the "gc" script is to be ignored. The idea was that this
> would be a garbage collector that could automatically figure out all
> inconsistencies and fix them. MJD never quite completed it and/or
> worked up the confidence and courage to let it loose on a live
> database.
>
> Hope this helps,
>
> Rutger
>
> On Wed, Mar 17, 2010 at 8:43 PM, Vladimir Gapeyev
> <vla...@du...> wrote:
>>
>> On Mar 17, 2010, at 10:29 AM, Vladimir Gapeyev wrote:
>>
>>> On Mar 17, 2010, at 10:05 AM, Hilmar Lapp wrote:
>>>
>>>> Rutger - where do the consistency tests stand (#2899240).  
>>>> Vladimir is
>>>> going to try to run those which exist, but I'm not sure about the
>>>> coverage - is it enough to give us any confidence?
>>>
>>> To add, these are the only things I detected that I guess might have
>>> relevance to data consistency checking:
>>> treebase-core/src/main/perl/bin/check
>>> treebase-core/src/main/perl/check/check
>>> treebase-core/src/main/perl/lib/CIPRES/TreeBase
>>
>>
>> Here is what I got.
>>
>> The two check scripts are actually the same.  The only thing I could
>> get out of them is printing out contents of an object specified by  
>> its
>> class/table name and an ID.
>>
>> There is another script, perl/bin/gc.  The wiki description for it is
>> "Garbage collector, prints out orphaned objects (e.g. trees without
>> studies), presumably candidates for deletion."  A few excerpts from
>> its printout are below -- I am not sure how to interpret them.
>>
>> Anyone in the know, please point me in the correct direction.
>>
>> --Vladimir
>>
>>
>> [vg34@treebasedb-dev ConsistencyChecks]$ perl/bin/gc
>> Database contains 5392 Analysis items
>> Database contains 5397 AnalysisStep items
>> Database contains 12378 AnalyzedData items
>> Database contains 4579 Matrix items
>> Database contains 236604 MatrixRow items
>> Database contains 6613 PhyloTree items
>> Database contains 557909 PhyloTreeNode items
>> Database contains 2454 Study items
>> Database contains 168318 TaxonLabel items
>> S127 8/8
>> S1801 3/3
>> S71 2/2
>> S1648 2/2
>> S1481 2/2
>> S10309 4/4
>> S10122 2/2
>> S1178 4/4
>> .....    // I suspect it prints out *all* the studies
>> * Analysis 4762
>> * Analysis 4764
>> * Analysis 4821
>> * Analysis 4842
>> * AnalysisStep 4821
>> * Matrix 181
>> * Matrix 182
>> * Matrix 183
>> * Matrix 184
>> * Matrix 185
>> * Matrix 186
>> * Matrix 355
>> * Matrix 367
>> * Matrix 990
>> * Matrix 992
>> * Matrix 993
>> * Matrix 994
>> * Matrix 997
>> * Matrix 998
>> * Matrix 999
>> * Matrix 1000
>> * Matrix 1001
>> * Matrix 1617
>> * Matrix 1618
>> * Matrix 1903
>> * Matrix 2146
>> * Matrix 3702
>> * Matrix 4070
>> * Matrix 4110
>> * Matrix 4130
>> * Matrix 4150
>> * Matrix 4227
>> * Matrix 4280
>> * Matrix 4456
>> * Matrix 4528
>> * Matrix 4778
>> * Matrix 4893
>> * MatrixRow 4091
>> * MatrixRow 4092
>> * MatrixRow 4093
>> * MatrixRow 4094
>> * MatrixRow 4095
>> * MatrixRow 4096
>> * MatrixRow 4097
>> ....     //Are these the orphans?  These are all Analyses and  
>> Matrices
>> from the output, but I skip most MatrixRows, as there are many
>> * MatrixRow 234956
>> * MatrixRow 234957
>> * MatrixRow 234958
>> * MatrixRow 234959
>> * PhyloTree 85
>> * PhyloTree 86
>> * PhyloTree 88
>> * PhyloTree 181
>> .... ///    It prints out a lot of PhyloTrees, likely all of them
>> * PhyloTree 6978
>> * PhyloTree 6979
>> * PhyloTree 6980
>> * PhyloTree 6981
>> * PhyloTreeNode 76327
>> * PhyloTreeNode 76328
>> * PhyloTreeNode 76329
>> * PhyloTreeNode 76330
>> * PhyloTreeNode 76331
>> * PhyloTreeNode 76332
>> .....
>> * PhyloTreeNode 76488
>> * PhyloTreeNode 76489
>> * PhyloTreeNode 76490
>> * PhyloTreeNode 153706        //a sharp jump
>> * PhyloTreeNode 153707
>> * PhyloTreeNode 153708
>> * PhyloTreeNode 153709
>> * PhyloTreeNode 153710
>> .....
>> * PhyloTreeNode 559205
>> * PhyloTreeNode 559206
>> * PhyloTreeNode 559207
>> * TaxonLabel 1288
>> * TaxonLabel 1289
>> * TaxonLabel 1290
>> * TaxonLabel 1291
>> * TaxonLabel 1292
>> .......
>> * TaxonLabel 276777
>> * TaxonLabel 276778
>> * TaxonLabel 276779
>> * TaxonLabel 276780
>> * TaxonLabel 276781
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Download Intel&#174; Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Treebase-devel mailing list
>> Tre...@li...
>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>
>
>
>
> -- 
> Dr. Rutger A. Vos
> School of Biological Sciences
> Philip Lyle Building, Level 4
> University of Reading
> Reading
> RG6 6BX
> United Kingdom
> Tel: +44 (0) 118 378 7535
> http://www.nexml.org
> http://rutgervos.blogspot.com
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel

-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================