From: William P. <wil...@ya...> - 2010-01-22 19:58:18
|
Hi all, I'm a bit confused by an issue that has cropped up. The data model contains two tables, study and submission in which the submission table contains a study_id column as FK. I cannot fathom why one study could have two or more submissions, so I think we are to assume a one-to-one connection between these tables. I'm not clear why we need a separate study and submission table at all -- possibly it is because an early iteration envisioned a scenario where new submissions are tracked with a submission_id, and the submitting person is not informed of the study_id until the data get vetted. Or perhaps we envisioned that the editing of a new submission gets taken over by a co-author on the same existing study_id record. At any rate, for whatever reason, we have study_ids and submission_ids to contend with. To date, all migrated data, and all new test submissions created at SDSC, have their study_id equal to their submission_id. So up until now, I hadn't really noticed an issue because study number and submission number were completely interchangeable. I had assumed that when a new submission is created, and the submission_id says "1234", then if it ever gets published, the study_id will also be "1234". Now, however, this has changed -- the two ids have become out of sync. And once we change over to a common sequence table, there is no hope of the two being in sync ever again. In terms of database integrity, there's probably no problem. In terms of confusion with our users, however, we may have problems. For example, I think it was Rutger who kindly added a radio button to the searchBySubmissionID.html page so that an Admin person can find a submission/study to edit it (see attached image). The radio button seems to say "search by TreeBASE2 Id" -- but in fact, it is really searching by submission_id, which happens to work for 99% of studies (because the two ids are in sync), but will not work from hereafter. For this to be useful, the dialog box should be changed to offer three choices: TreeBASE2 study_id, TreeBASE2 submission_id, and TreeBASE1 legacy study_id. When a user creates a new submission record, a new study record is also instantly created. The submission summary page lists the submission_id (e.g. 10050, below) as well as the study_id imbedded in the "right-click and copy me" link (in this case "S10060" -- out of sync by a small but significant value from 10050). When an editor clicks the "Contact Submitter" link, an email with the subject header "TreeBASE Submission S10050" is generated on a local email client. So, in all email discussions between submitter and editor, the submission_id is being used to make it clear what the discussion is about. The risk here is that some submitters will go ahead and cite the meaningless "S10050" number in their manuscripts instead of the "S10060" number (or a phylows URL containing that number). I guess we have a couple of solutions: 1. mechanically make it so that submission_id and study_id are always in sync, thereby making them interchangeable and the whole issue moot. The problem is that this may be difficult to fix without having to change lots of code. 2. hide the submission_id from *everyone* -- submitters, users, and editors alike. Wherever the submission_id is shown, have it show the study_id instead. In email discussions between submitter and editor, only the study_id is cited. The downside is that this will probably take a lot of work. 3. fix the searchBySubmissionID.html page so that there are three radio buttons to choose from (study_id, submission_id, and legacy_id). Then add a "Study Accession URL" to the submission summary page (in addition to the Reviewer Access URL) so that the user knows exactly what to cite and does not confuse submission_id with study_id. I guess I vote for solution 3, since that requires the least amount of coding. The risk is that users will contact editors with questions like "what happened to the data in S1234" and we won't know if they mean submission S1234 or study S1234 -- the result is an additional back-and-forth of emails to clarify. What might mitigate this possibility, is if everywhere that the submission_id appears on our web pages, we write it like "Sub1234" -- making it more likely that in future communication we know what the integer means. Bill This is the Search Submission page which is used by editors and admin users. Currently the "TreeBASE2 Id" actually searches the submission_id. Either both ids should be totally in sync or we should add another radio button to distinguish the two. Summary for current study page for submitting users and for admin editors. Here the reviewer access URL correctly uses the study_id, whereas the submission_id is presented to the user. Either both ids should be totally in sync, or we should be clear on what the final "Study Accession URL" will look like (with the study_id embedded therein). |
From: youjun g. <you...@ya...> - 2010-01-27 16:45:35
|
Dear All, I am investigating the cause of cascade delete failed on submission 22. There are two issues that confuse me. 1. Why we need table sub_taxonlabel, sub_treeblock and sub_matrix (a typical many to many setup)? but the relationship between taxonlabel, treeblock and matrix to submission is many to one. this three unnecessary bridge tables can hurt the performance. 2. The deletion of a phylotree node phylotreenode table have 3 foreign keys point to the same table parent_id, child_id and sibling_id. To delete a node all those foreign key value which point to the deleting node need be set to "null" first, otherwise the postgresql will complain. When delete all the node in a tree, TB code issued two lines of sql: > "update phylotree set rootNode_id = null where phylotree_id = :treeID"; > "delete from phylotreenode where phylotree_id = :treeID"; > This should not work unless the foreigns keys allow cascade delete. YOujun |
From: youjun g. <you...@ya...> - 2010-01-28 15:20:39
|
Dear All, Just for you info, Based on the test results of Bill and I, when delete a group of phylotreenode, it will not cause foreign key constrain violation, as long as after deletion there is no broken foreign in finish state. This means that deletion of a tree have to be atomic. In table phylotree, many trees do not have a study_id value, but their phylotreenode related to study 22 via table taxonlabel. Their phylotree_id are: 1129,1130,1131,1132, 2333,2334,2335,2336,2337,2338, 2339,2340,2341,2343,2556,2557,2686,2726,2727,2787,2788,2789,2790, 3446,3671,3766,3767,3901,3902,3903,3904,3905,3906,3907,3908,3909,3910,3911,3912, 4062,4063, 5705,5706,5707,5708,5720,5721,5921,5941,5981,160000022341 I will delete those trees because their foreign key constrain prevent me from cleaning up about 10,000 dummy taxonlabel records related to submission 22 (study 22 and submission 22 happen to be the same). Youjun |
From: Vladimir G. <vla...@du...> - 2010-01-28 16:24:47
|
On Jan 28, 2010, at 10:20 AM, youjun guo wrote: > In table phylotree, many trees do not have a study_id value, but > their phylotreenode related to study 22 via table taxonlabel. Their > phylotree_id are: > > 1129,1130,1131,1132, 2333,2334,2335,2336,2337,2338, > 2339,2340,2341,2343,2556,2557,2686,2726,2727,2787,2788,2789,2790 > , > 3446,3671,3766,3767,3901,3902,3903,3904,3905,3906,3907,3908,3909,3910,3911,3912 > , 4062,4063, > 5705,5706,5707,5708,5720,5721,5921,5941,5981,160000022341 > > I will delete those trees because their foreign key constrain > prevent me from cleaning up about 10,000 dummy taxonlabel records > related to submission 22 (study 22 and submission 22 happen to be > the same). Youjun, Bill, Thanks for the notice! After you iron out all the wrinkles of this study22 cleanup on the treebasedev instance, please apply the cleanup to the treebasestage instance. All, treebasestage is an instance restored from a backup of treebasedev done around Dec 8. This is the instance where we will construct the new "pristine" copy of the data. The steps that I propose are: (1) Apply all known clean-ups. Study22 is one. I recall there are a couple more, but I need to check past emails to remember what they are. If anyone would like to take charge of that, that would be great. I do think, though, that any clean-up task should be first debugged and tested on treebasedev. (2) Apply hibernate_sequence fix. It appears we are still figuring out what the fix should be. Among other considerations, it may depend on the outcome of testing the migration procedures. (3) Apply Dec 2009 migration. That's after I finish testing it on my local (initially empty) instance and, maybe, on treebasedev. (4) Inspect it by hand to determine if more cleanup is needed and fix any problems. (5) Apply Jan 2010 migration. (6) Restore from treebasestage to the "real" production instance. (And to treebasedev as well.) It is possible that steps (2) and (3) may need to be swapped (depending on how the migration tools behave). I am also not sure about the order of (4) and (5). While this is going on with treebasestage, Jon will use the same Dec 8 backup to bring on-line the real production instance. He (and anyone else interested) can use it, in parallel, for testing any configuration, performance, availability, etc. issues that may come up. The aim is to have the production fully operational, so that the only remaining tasks on Day 0 would be [1] restoring DB from treebasestage and [2] dropping the appropriate WAR. Please raise your concerns if anything appears off with this data migration plan. --Vladimir |