From: Rutger V. <rut...@gm...> - 2010-01-29 16:42:25
|
I think this came about because of an older, incorrect, implementation of how taxonlabels need to be generated when new data is submitted. For some time the "logic" was that if a newly submitted string already exists in the taxonlabel table we re-use that. Obviously this is incorrect behaviour - the label objects must be scoped within the study - but it's left some scrambled records which unfortunately need to be disentangled. On Fri, Jan 29, 2010 at 5:28 AM, William Piel <wil...@ya...> wrote: > I've been looking at some problems with our data. One problem is that some taxonlabel records are mapping to more than one study. This should be illegal -- i.e. we are supposed to have a system where taxonlabel records are scoped to the study, not to the database: although a taxonlabel may be shared by several trees and/or several matrices, these trees and matrices cannot belong to different studies. > > Using this query: > > SELECT DISTINCT pt.study_id > FROM taxonlabel tl JOIN phylotreenode ptn ON (tl.taxonlabel_id = ptn.taxonlabel_id) > JOIN phylotree pt ON (ptn.phylotree_id = pt.phylotree_id) > WHERE tl.study_id <> pt.study_id > ORDER BY pt.study_id > > result: 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 412, 912, 1001, 1041, 1224, 1652, 1922, 2369, 2374, 2377 > > So I find that there are 27 studies that have, in some capacity, sharing 450 or so taxonlabels by way of their trees. This is only 1% of our data, so that's not too bad, but it needs to be investigated. I took a look at them: 235 through 252 are made up of submission that lack a matrix. It's possible that the migration script messes up when it encounters missing matrices. On the other hand, 412, 912, 1001 [...] 1922 seem to look normal (no missing matrix), while 2369, 2374, and 2377 are missing altogether (studies don't exist). > > Next, I'll look at the same taxonlabel shared by two or more taxonlabelset records that belong to two different studies: > > SELECT DISTINCT tls.study_id > FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON (tl.taxonlabel_id = tlstl.taxonlabel_id) > JOIN taxonlabelset tls ON (tls.taxonlabelset_id = tlstl.taxonlabelset_id) > WHERE tl.study_id <> tls.study_id > ORDER BY tls.study_id > > result: 22, 2264. > > So, only 2 studies. 22 was used for testing; 2264 is missing (study doesn't exist). > > Now I'll do the same, but this time by way of the matrix: > > SELECT DISTINCT mx.study_id > FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = mxr.taxonlabel_id) > JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id) > WHERE tl.study_id <> mx.study_id > ORDER BY mx.study_id > > result: 2369, 2374 > > Again, two studies_ids for missing studies. > > So it seems to me that some cases are linked to instances of studies that lack matrices; other cases have nothing distinctive; and finally, in cases of deleted studies, various orphaned records are still around with mixed up study_ids. One solution is to delete all these studies and re-upload them manually, but I guess we need to first get the sequencing sorted out before we add new studies. > > bp > > > > > > > > > > ------------------------------------------------------------------------------ > The Planet: dedicated and managed hosting, cloud storage, colocation > Stay online with enterprise data centers and the best network in the business > Choose flexible plans and management services without long-term contracts > Personal 24x7 support from experience hosting pros just a phone call away. > http://p.sf.net/sfu/theplanet-com > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |