[Treebase-devel] Some data problems

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've been looking at some problems with our data. One problem is that some taxonlabel records are mapping to more than one study. This should be illegal -- i.e. we are supposed to have a system where taxonlabel records are scoped to the study, not to the database: although a taxonlabel may be shared by several trees and/or several matrices, these trees and matrices cannot belong to different studies. 

Using this query:

SELECT DISTINCT pt.study_id
FROM taxonlabel tl JOIN phylotreenode ptn ON (tl.taxonlabel_id = ptn.taxonlabel_id)
JOIN phylotree pt ON (ptn.phylotree_id = pt.phylotree_id)
WHERE tl.study_id <> pt.study_id
ORDER BY pt.study_id

result: 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 412, 912, 1001, 1041, 1224, 1652, 1922, 2369, 2374, 2377

So I find that there are 27 studies that have, in some capacity, sharing 450 or so taxonlabels by way of their trees. This is only 1% of our data, so that's not too bad, but it needs to be investigated. I took a look at them: 235 through 252 are made up of submission that lack a matrix. It's possible that the migration script messes up when it encounters missing matrices. On the other hand, 412, 912, 1001 [...] 1922  seem to look normal (no missing matrix), while 2369, 2374, and 2377 are missing altogether (studies don't exist). 

Next, I'll look at the same taxonlabel shared by two or more taxonlabelset records that belong to two different studies:

SELECT DISTINCT tls.study_id
FROM taxonlabel tl JOIN taxonlabelset_taxonlabel tlstl ON (tl.taxonlabel_id = tlstl.taxonlabel_id)
JOIN taxonlabelset tls ON (tls.taxonlabelset_id = tlstl.taxonlabelset_id)
WHERE tl.study_id <> tls.study_id
ORDER BY tls.study_id

result: 22, 2264. 

So, only 2 studies. 22 was used for testing; 2264 is missing (study doesn't exist). 

Now I'll do the same, but this time by way of the matrix:

SELECT DISTINCT mx.study_id
FROM taxonlabel tl JOIN matrixrow mxr ON (tl.taxonlabel_id = mxr.taxonlabel_id)
JOIN matrix mx ON (mxr.matrix_id = mx.matrix_id)
WHERE tl.study_id <> mx.study_id
ORDER BY mx.study_id

result: 2369, 2374

Again, two studies_ids for missing studies. 

So it seems to me that some cases are linked to instances of studies that lack matrices; other cases have nothing distinctive; and finally, in cases of deleted studies, various orphaned records are still around with mixed up study_ids.  One solution is to delete all these studies and re-upload them manually, but I guess we need to first get the sequencing sorted out before we add new studies. 

bp