Re: [Treebase-devel] Some data problems

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I think this came about because of an older, incorrect, implementation
of how taxonlabels need to be generated when new data is submitted.
For some time the "logic" was that if a newly submitted string already
exists in the taxonlabel table we re-use that. Obviously this is
incorrect behaviour - the label objects must be scoped within the
study - but it's left some scrambled records which unfortunately need
to be disentangled.

On Fri, Jan 29, 2010 at 5:28 AM, William Piel <wil...@ya...> wrote:
> I've been looking at some problems with our data. One problem is that some taxonlabel records are mapping to more than one study. This should be illegal -- i.e. we are supposed to have a system where taxonlabel records are scoped to the study, not to the database: although a taxonlabel may be shared by several trees and/or several matrices, these trees and matrices cannot belong to different studies.
>
> Using this query:
>
> SELECT DISTINCT pt.study_id
> FROM taxonlabel tl JOIN phylotreenode ptn ON (tl.taxonlabel_id = ptn.taxonlabel_id)
> JOIN phylotree pt ON (ptn.phylotree_id = pt.phylotree_id)
> WHERE tl.study_id <> pt.study_id
> ORDER BY pt.study_id
>
> result: 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 412, 912, 1001, 1041, 1224, 1652, 1922, 2369, 2374, 2377
>
> So I find that there are 27 studies that have, in some capacity, sharing 450 or so taxonlabels by way of their trees. This is only 1% of our data, so that's not too bad, but it needs to be investigated. I took a look at them: 235 through 252 are made up of submission that lack a matrix. It's possible that the migration script messes up when it encounters missing matrices. On the other hand, 412, 912, 1001 [...] 1922  seem to look normal (no missing matrix), while 2369, 2374, and 2377 are missing altogether (studies don't exist).
>
> Next, I'll look at the same taxonlabel shared by two or more taxonlabelset records that belong to two different studies:
>
> SELECT DISTINCT tls.study_id
> FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON (tl.taxonlabel_id = tlstl.taxonlabel_id)
> JOIN taxonlabelset tls ON (tls.taxonlabelset_id = tlstl.taxonlabelset_id)
> WHERE tl.study_id <> tls.study_id
> ORDER BY tls.study_id
>
> result: 22, 2264.
>
> So, only 2 studies. 22 was used for testing; 2264 is missing (study doesn't exist).
>
> Now I'll do the same, but this time by way of the matrix:
>
> SELECT DISTINCT mx.study_id
> FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = mxr.taxonlabel_id)
> JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id)
> WHERE tl.study_id <> mx.study_id
> ORDER BY mx.study_id
>
> result: 2369, 2374
>
> Again, two studies_ids for missing studies.
>
> So it seems to me that some cases are linked to instances of studies that lack matrices; other cases have nothing distinctive; and finally, in cases of deleted studies, various orphaned records are still around with mixed up study_ids.  One solution is to delete all these studies and re-upload them manually, but I guess we need to first get the sequencing sorted out before we add new studies.
>
> bp
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> The Planet: dedicated and managed hosting, cloud storage, colocation
> Stay online with enterprise data centers and the best network in the business
> Choose flexible plans and management services without long-term contracts
> Personal 24x7 support from experience hosting pros just a phone call away.
> http://p.sf.net/sfu/theplanet-com
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com