From: William P. <wil...@ya...> - 2010-03-13 22:19:08
|
This is mainly for Vladimir: I think I've found the problem with the migration data -- as mentioned before, the migrated data fail to display taxon mapping and fail to download trees. To investigate this, I selected two taxonlabel_ids, 4013 and 271332. I selected these because they both correspond to the taxon label string 'Agaricus bisporus', and in both cases they are correctly mapped to the TI (both have a taxonvariant_id of 1420). The only difference is that one is found in study_id 1959 (from an older migration) while the other is found in study_id 10294 (migrated in the last batch). We have a table called taxonlabelset and a many-to-many joiner table called taxonlabelset_taxonlabel. The taxonlabelset table roughly represents the "BEGIN TAXA" block in the NEXUS format: an ordered list of taxonlabels that is referred to by subsequent "BEGIN TREE" blocks, "BEGIN DATA" blocks, or "BEGIN CHARACTERS" blocks. When I run this query: SELECT study_id, taxonlabelset_id, taxonlabel_id, taxonlabel_order FROM taxonlabelset_taxonlabel JOIN taxonlabelset USING (taxonlabelset_id) WHERE taxonlabel_id IN (271332, 4013) I get the output below, indicating that these two taxonlabel_ids are found in many taxonlabelsets. Unfortunately, most of them are bogus, orphaned taxonlabelset records that seem to be created in great abundance -- these are the ones with study_id 22 and study_id 10215. I gather that study_id 22 and 10215 serve as temporary parking spots for migrated data, but unfortunately not all records get deleted after the data are moved to their proper study_ids. You'll notice, that taxonlabel_id 4013 belongs to taxonlabelset_id 13428. This taxonlabelset_id is referenced by a matrix record and a treeblock record. Which is good: it shows that taxonlabel_id 4013 belongs to a taxonlabelset that is referenced by a treeblock and a matrix, which presumably are both linked in the same analysis step. i.e., the taxonlabels in the trees match with the taxonlabels in the matrices that these trees are derived from. With respect to taxonlabel_id 271332, we have a different story. On the one hand, taxonlabel_id 271332 is found in a bunch of orphaned records linked to the bogus study_id 10215. Unfortunately, unlike taxonlabel_id 4013, taxonlabel_id 271332 does not belong to a single taxonlabelset that references study_id 10294. Now, study_id 10294 does have a matrix that lists taxonlabelset_id 15812, but in the taxonlabelset table, taxonlabelset_id 15812 references the bogus study_id. So one correction is the following: FOR ALL taxonlabelset RECORDS THAT REFERENCE study_id 10215, CHANGE THEIR study_id TO MATCH THE study_id REFERENCED IN THE matrix RECORDS THAT USE THE SAME taxonlabelset_id. But unfortunately we're not done. When I examined the treeblock table, I noticed that the treeblock record that is related to taxonlabel_id 271332 has NULL for taxonlabelset_id. This surely explains why trees can't be downloaded. To fix this, we need to update the treeblock table with the same taxonlabelset_id as the matrix (or one of the matrices) that this treeblock is related to by way of an analysis step linkage. This is because we need to pick a set of taxon labels that is compatible with the block of trees, and trees can only belong to the same analysis step as a matrix if their taxon labels are compatible. So... I think that this query provides a list of each treeblock_id and a taxonlabelset_id from a corresponding matrix that is joined to this treeblock by way of an analysisstep: SELECT DISTINCT ON (tb.treeblock_id) tb.treeblock_id, mx.taxonlabelset_id FROM matrix mx JOIN analyzeddata am ON (mx.matrix_id = am.matrix_id) JOIN analyzeddata at ON (am.analysisstep_id = at.analysisstep_id) JOIN phylotree pt ON (at.phylotree_id = pt.phylotree_id) JOIN treeblock tb ON (pt.treeblock_id = tb.treeblock_id) WHERE tb.taxonlabelset_id IS NULL We should take this list and use the first column to update the treeblock table with the taxonlabelset_id values in the second column. regards, Bill study_id, taxonlabelset_id, taxonlabel_id, taxonlabel_order ------------------------------- 22 181 4013 68 22 226 4013 35 22 403 4013 4 22 404 4013 24 22 413 4013 20 22 544 4013 20 22 586 4013 20 22 787 4013 20 22 872 4013 122 22 874 4013 65 22 875 4013 0 22 1003 4013 52 22 1692 4013 26 22 2300 4013 2 22 2357 4013 8 22 2358 4013 82 22 2401 4013 25 22 2463 4013 51 22 2510 4013 71 22 2590 4013 208 22 2663 4013 20 22 2720 4013 22 22 2904 4013 175 22 2927 4013 20 22 2942 4013 20 22 3654 4013 20 22 3898 4013 13 22 4013 4013 20 22 4748 4013 44 22 4830 4013 20 22 5498 4013 35 22 5499 4013 35 22 5500 4013 35 22 5647 4013 59 22 5921 4013 54 22 5963 4013 181 22 6606 4013 30 22 6920 4013 16 22 7549 4013 35 22 7652 4013 8 22 7654 4013 107 22 7655 4013 107 22 7691 4013 107 22 7700 4013 73 22 7701 4013 73 22 7702 4013 73 22 7703 4013 73 22 7829 4013 51 22 7830 4013 29 22 7831 4013 29 22 7840 4013 23 22 8126 4013 24 22 8413 4013 44 22 9045 4013 53 22 9046 4013 53 22 9058 4013 143 22 9089 4013 19 22 9112 4013 107 22 9113 4013 136 22 9114 4013 11 22 9115 4013 11 10215 15687 271332 20 10215 15812 271332 54 10215 15854 271332 20 10215 15927 271332 80 1959 13428 4013 20 |