Re: [Treebase-devel] Consistency tests...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

sorry about the late response. Here's how it works, (to the extent
that I've managed to understand MJD's code): there is a "check"
script. This script needs two arguments: a table name (out of which
MJD's code creates a perl ORM object) and an ID in that table. The
script then tries to construct the logically expected subtended object
hierarchy starting from the focal object. Anything unexpected is
written two STDERR. The most useful way to use this is to say "check
Study $studyID". What I've done in the past is to dump all study IDs
to a file "STUDIES", and then running the following shell script:

#!/bin/bash
studies=`cat STUDIES`
for study in $studies; do
	check Study $study 2> $study.err
	logfilesize=`wc -l $study.err | cut -f1 -d' '`
	if [[ $logfilesize > 0 ]]
	then
		gzip -9 $study.err
	else
		rm $study.err
	fi
done

This will create a $studyID.gz file for every inconsistent study. On
closer examination of these, most inconsistencies lead back to only a
handful of problems, mostly related to incomplete repatriation of
objects from dummy study 22 to their destination study. It's therefore
more informative to bin the inconsistencies by category as opposed to
by study. For this, MJD has written a "digester" script. Assuming you
have a directory full of gzipped study reports, you can then run the
following shell script to categorize the reports:

#!/bin/bash
zips=`ls *.gz`
for zip in $zips; do
	gunzip $zip
	base=`echo $zip | sed -e 's/\.gz//'`
	dir=`echo $base | sed -e 's/\.err//'`
	grep '\*' $base | digester -d $dir
	gzip -9 $base
	cd $dir
	logs=`ls *`
	for log in $logs; do
		cat $log >> ../$log
	done
	cd ../
done

This will create files such as "tree_references_tls_but_its_no", which
lists the PhyloTree objects that reference TaxonLabelSet X, whereas
some of its nodes reference a TaxonLabel that is in TaxonLabelSet Y.
In all these cases, X is still linked to Study 22 (so not repatriated
correctly) while the individual labels and their Y are in the right
place.

By the way, the "gc" script is to be ignored. The idea was that this
would be a garbage collector that could automatically figure out all
inconsistencies and fix them. MJD never quite completed it and/or
worked up the confidence and courage to let it loose on a live
database.

Hope this helps,

Rutger

On Wed, Mar 17, 2010 at 8:43 PM, Vladimir Gapeyev
<vla...@du...> wrote:
>
> On Mar 17, 2010, at 10:29 AM, Vladimir Gapeyev wrote:
>
>> On Mar 17, 2010, at 10:05 AM, Hilmar Lapp wrote:
>>
>>> Rutger - where do the consistency tests stand (#2899240). Vladimir is
>>> going to try to run those which exist, but I'm not sure about the
>>> coverage - is it enough to give us any confidence?
>>
>> To add, these are the only things I detected that I guess might have
>> relevance to data consistency checking:
>> treebase-core/src/main/perl/bin/check
>> treebase-core/src/main/perl/check/check
>> treebase-core/src/main/perl/lib/CIPRES/TreeBase
>
>
> Here is what I got.
>
> The two check scripts are actually the same.  The only thing I could
> get out of them is printing out contents of an object specified by its
> class/table name and an ID.
>
> There is another script, perl/bin/gc.  The wiki description for it is
> "Garbage collector, prints out orphaned objects (e.g. trees without
> studies), presumably candidates for deletion."  A few excerpts from
> its printout are below -- I am not sure how to interpret them.
>
> Anyone in the know, please point me in the correct direction.
>
> --Vladimir
>
>
> [vg34@treebasedb-dev ConsistencyChecks]$ perl/bin/gc
> Database contains 5392 Analysis items
> Database contains 5397 AnalysisStep items
> Database contains 12378 AnalyzedData items
> Database contains 4579 Matrix items
> Database contains 236604 MatrixRow items
> Database contains 6613 PhyloTree items
> Database contains 557909 PhyloTreeNode items
> Database contains 2454 Study items
> Database contains 168318 TaxonLabel items
> S127 8/8
> S1801 3/3
> S71 2/2
> S1648 2/2
> S1481 2/2
> S10309 4/4
> S10122 2/2
> S1178 4/4
> .....    // I suspect it prints out *all* the studies
> * Analysis 4762
> * Analysis 4764
> * Analysis 4821
> * Analysis 4842
> * AnalysisStep 4821
> * Matrix 181
> * Matrix 182
> * Matrix 183
> * Matrix 184
> * Matrix 185
> * Matrix 186
> * Matrix 355
> * Matrix 367
> * Matrix 990
> * Matrix 992
> * Matrix 993
> * Matrix 994
> * Matrix 997
> * Matrix 998
> * Matrix 999
> * Matrix 1000
> * Matrix 1001
> * Matrix 1617
> * Matrix 1618
> * Matrix 1903
> * Matrix 2146
> * Matrix 3702
> * Matrix 4070
> * Matrix 4110
> * Matrix 4130
> * Matrix 4150
> * Matrix 4227
> * Matrix 4280
> * Matrix 4456
> * Matrix 4528
> * Matrix 4778
> * Matrix 4893
> * MatrixRow 4091
> * MatrixRow 4092
> * MatrixRow 4093
> * MatrixRow 4094
> * MatrixRow 4095
> * MatrixRow 4096
> * MatrixRow 4097
> ....     //Are these the orphans?  These are all Analyses and Matrices
> from the output, but I skip most MatrixRows, as there are many
> * MatrixRow 234956
> * MatrixRow 234957
> * MatrixRow 234958
> * MatrixRow 234959
> * PhyloTree 85
> * PhyloTree 86
> * PhyloTree 88
> * PhyloTree 181
> .... ///    It prints out a lot of PhyloTrees, likely all of them
> * PhyloTree 6978
> * PhyloTree 6979
> * PhyloTree 6980
> * PhyloTree 6981
> * PhyloTreeNode 76327
> * PhyloTreeNode 76328
> * PhyloTreeNode 76329
> * PhyloTreeNode 76330
> * PhyloTreeNode 76331
> * PhyloTreeNode 76332
> .....
> * PhyloTreeNode 76488
> * PhyloTreeNode 76489
> * PhyloTreeNode 76490
> * PhyloTreeNode 153706        //a sharp jump
> * PhyloTreeNode 153707
> * PhyloTreeNode 153708
> * PhyloTreeNode 153709
> * PhyloTreeNode 153710
> .....
> * PhyloTreeNode 559205
> * PhyloTreeNode 559206
> * PhyloTreeNode 559207
> * TaxonLabel 1288
> * TaxonLabel 1289
> * TaxonLabel 1290
> * TaxonLabel 1291
> * TaxonLabel 1292
> .......
> * TaxonLabel 276777
> * TaxonLabel 276778
> * TaxonLabel 276779
> * TaxonLabel 276780
> * TaxonLabel 276781
>
>
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com