From: William P. <wil...@ya...> - 2010-01-29 04:29:03
|
I've been looking at some problems with our data. One problem is that some taxonlabel records are mapping to more than one study. This should be illegal -- i.e. we are supposed to have a system where taxonlabel records are scoped to the study, not to the database: although a taxonlabel may be shared by several trees and/or several matrices, these trees and matrices cannot belong to different studies. Using this query: SELECT DISTINCT pt.study_id FROM taxonlabel tl JOIN phylotreenode ptn ON (tl.taxonlabel_id = ptn.taxonlabel_id) JOIN phylotree pt ON (ptn.phylotree_id = pt.phylotree_id) WHERE tl.study_id <> pt.study_id ORDER BY pt.study_id result: 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 412, 912, 1001, 1041, 1224, 1652, 1922, 2369, 2374, 2377 So I find that there are 27 studies that have, in some capacity, sharing 450 or so taxonlabels by way of their trees. This is only 1% of our data, so that's not too bad, but it needs to be investigated. I took a look at them: 235 through 252 are made up of submission that lack a matrix. It's possible that the migration script messes up when it encounters missing matrices. On the other hand, 412, 912, 1001 [...] 1922 seem to look normal (no missing matrix), while 2369, 2374, and 2377 are missing altogether (studies don't exist). Next, I'll look at the same taxonlabel shared by two or more taxonlabelset records that belong to two different studies: SELECT DISTINCT tls.study_id FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON (tl.taxonlabel_id = tlstl.taxonlabel_id) JOIN taxonlabelset tls ON (tls.taxonlabelset_id = tlstl.taxonlabelset_id) WHERE tl.study_id <> tls.study_id ORDER BY tls.study_id result: 22, 2264. So, only 2 studies. 22 was used for testing; 2264 is missing (study doesn't exist). Now I'll do the same, but this time by way of the matrix: SELECT DISTINCT mx.study_id FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = mxr.taxonlabel_id) JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id) WHERE tl.study_id <> mx.study_id ORDER BY mx.study_id result: 2369, 2374 Again, two studies_ids for missing studies. So it seems to me that some cases are linked to instances of studies that lack matrices; other cases have nothing distinctive; and finally, in cases of deleted studies, various orphaned records are still around with mixed up study_ids. One solution is to delete all these studies and re-upload them manually, but I guess we need to first get the sequencing sorted out before we add new studies. bp |
From: Rutger V. <rut...@gm...> - 2010-01-29 16:42:25
|
I think this came about because of an older, incorrect, implementation of how taxonlabels need to be generated when new data is submitted. For some time the "logic" was that if a newly submitted string already exists in the taxonlabel table we re-use that. Obviously this is incorrect behaviour - the label objects must be scoped within the study - but it's left some scrambled records which unfortunately need to be disentangled. On Fri, Jan 29, 2010 at 5:28 AM, William Piel <wil...@ya...> wrote: > I've been looking at some problems with our data. One problem is that some taxonlabel records are mapping to more than one study. This should be illegal -- i.e. we are supposed to have a system where taxonlabel records are scoped to the study, not to the database: although a taxonlabel may be shared by several trees and/or several matrices, these trees and matrices cannot belong to different studies. > > Using this query: > > SELECT DISTINCT pt.study_id > FROM taxonlabel tl JOIN phylotreenode ptn ON (tl.taxonlabel_id = ptn.taxonlabel_id) > JOIN phylotree pt ON (ptn.phylotree_id = pt.phylotree_id) > WHERE tl.study_id <> pt.study_id > ORDER BY pt.study_id > > result: 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 412, 912, 1001, 1041, 1224, 1652, 1922, 2369, 2374, 2377 > > So I find that there are 27 studies that have, in some capacity, sharing 450 or so taxonlabels by way of their trees. This is only 1% of our data, so that's not too bad, but it needs to be investigated. I took a look at them: 235 through 252 are made up of submission that lack a matrix. It's possible that the migration script messes up when it encounters missing matrices. On the other hand, 412, 912, 1001 [...] 1922 seem to look normal (no missing matrix), while 2369, 2374, and 2377 are missing altogether (studies don't exist). > > Next, I'll look at the same taxonlabel shared by two or more taxonlabelset records that belong to two different studies: > > SELECT DISTINCT tls.study_id > FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON (tl.taxonlabel_id = tlstl.taxonlabel_id) > JOIN taxonlabelset tls ON (tls.taxonlabelset_id = tlstl.taxonlabelset_id) > WHERE tl.study_id <> tls.study_id > ORDER BY tls.study_id > > result: 22, 2264. > > So, only 2 studies. 22 was used for testing; 2264 is missing (study doesn't exist). > > Now I'll do the same, but this time by way of the matrix: > > SELECT DISTINCT mx.study_id > FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = mxr.taxonlabel_id) > JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id) > WHERE tl.study_id <> mx.study_id > ORDER BY mx.study_id > > result: 2369, 2374 > > Again, two studies_ids for missing studies. > > So it seems to me that some cases are linked to instances of studies that lack matrices; other cases have nothing distinctive; and finally, in cases of deleted studies, various orphaned records are still around with mixed up study_ids. One solution is to delete all these studies and re-upload them manually, but I guess we need to first get the sequencing sorted out before we add new studies. > > bp > > > > > > > > > > ------------------------------------------------------------------------------ > The Planet: dedicated and managed hosting, cloud storage, colocation > Stay online with enterprise data centers and the best network in the business > Choose flexible plans and management services without long-term contracts > Personal 24x7 support from experience hosting pros just a phone call away. > http://p.sf.net/sfu/theplanet-com > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |
From: Hilmar L. <hl...@ne...> - 2010-01-29 17:01:24
|
On Jan 28, 2010, at 11:28 PM, William Piel wrote: > SELECT DISTINCT tls.study_id > FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON > (tl.taxonlabel_id = tlstl.taxonlabel_id) > JOIN taxonlabelset tls ON (tls.taxonlabelset_id = > tlstl.taxonlabelset_id) > WHERE tl.study_id <> tls.study_id > ORDER BY tls.study_id > > result: 22, 2264. > > So, only 2 studies. 22 was used for testing; 2264 is missing (study > doesn't exist). > > Now I'll do the same, but this time by way of the matrix: > > SELECT DISTINCT mx.study_id > FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = > mxr.taxonlabel_id) > JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id) > WHERE tl.study_id <> mx.study_id > ORDER BY mx.study_id > > result: 2369, 2374 > > Again, two studies_ids for missing studies. Does this mean that there are dangling foreign keys? Are the foreign key constraints not enforced by the database? I guess that need not necessarily be resolved before the release, but it seems to me that there is a major data cleanup effort necessary at some point, combined with enforcing all possible constraints in the database. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: William P. <wil...@ya...> - 2010-01-29 19:55:32
|
On Jan 29, 2010, at 12:01 PM, Hilmar Lapp wrote: > I guess that need not necessarily be resolved before the release, but it seems to me that there is a major data cleanup effort necessary at some point, combined with enforcing all possible constraints in the database. I agree that this cleanup can be done after release -- there are so few studies in question that this can be done by hand. The main thing to check before release is that there are no current bugs that are creating this mis-wiring. Once we have a working sequence and the boolean-bug is fixed, I'll upload a bunch of test datasets and then rerun these queries to check that no new mis-wiring is happening. One thing I noticed is that when I upload a tree to a submission, and then delete the tree, the taxon labels are not deleted. Over time, this will created a clutter of taxon labels that don't belong to any tree or matrix. Should we call this a bug? If so, fixing it will require some effort: upon deletion of a tree or a matrix, the software must know to delete any related taxon labels unless they are being used by another matrix, rowsegment, or tree that still exists in the study. bp |
From: Hilmar L. <hl...@ne...> - 2010-01-29 21:47:13
|
On Jan 29, 2010, at 2:55 PM, William Piel wrote: > upon deletion of a tree or a matrix, the software must know to > delete any related taxon labels unless they are being used by > another matrix, rowsegment, or tree that still exists in the study. If my understanding is correct that foreign key enforcement isn't going to take care of this (due to n:n relationships), this can be done by regularly and automatically run maintenance scripts. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |