Re: [Gdpdm-devel] New accession identifier in GDPDM
Brought to you by:
tcasstevens
From: Ken Youens-C. <ky...@gm...> - 2005-11-09 15:27:43
|
On Nov 3, 2005, at 9:57 AM, Peter Bradbury wrote: > In response to Kens suggestion that new identifiers be added to > GDPDM, my feeling is that it is unnecessary at best and may > actually make the task of database maintenance more complicated not > less. One of the issues that Ken raises is how to tell whether a > proposed addition to the database already exists or is new. That > is indeed a difficult issue to address for several reasons. > However, the information needed to answer the question already > exists in the database. The information does not currently exist as an explicit part of the schema definition. I'm asking that each table have a UNIQUE value apart from it's primary key. > Adding another field that simply combines or duplicates existing > information doesn't help and violates good database design principles. I don't believe most accessions duplicate data. E.g., a GenBank accession such as "D50439" doesn't duplicate anything about the marker "P108," in fact it quite easily helps differentiate it from any other marker that might happen to be named "P108," and, further, allows or even encourages the duplication of marker names, which is a Good Thing (tm) in that it allows the database to reflect the reality that biologists sometimes choose the same name for different things. If we were to put a UNIQUE constraint on marker names, we probably won't be able to put in some rice SSRs, e.g.: http://dev.gramene.org/db/markers/marker_view?marker_name=rm1003 > That does not, however, mean that database tables that specifically > support the upload tools should not be created. I don't believe it's necessary to add new tables, just UNIQUE constraints. > They would not be part of GDPDM but part of the upload tool > application. I would suggest keeping them in a separate schema. > During development they could even be on a different machine. That > would provide complete freedom to use text fields to represent > FK's, deal with security issues that may require storage of > information about users, etc. I believe this suggestion is actually the one where data would be duplicated, only it would be in extra tables possibly in external databases. Not being an integral part of the schema, it would also not help me in integrating the GDPDM data into the rest of Gramene, e.g., how do I easily link into species or germplasm data without relying on knowing the primary key value (which might change)? That is, from our marker or comparative maps, I want to show the user more about what we're calling "rice," which is actually "Oryza sativa japonica," so I want to link to a GDPDM viewer using some value other than "42," something with a modicum of meaning like "osat" and something which I know won't change if the data gets updated or reloaded. > One thing missing from GDPDM that we might want to consider adding > is information about which combination of fields in each table > determine a unique record. If all those fields do not match then a > record is new. If all those fields match, then any other fields > that may be different get updated (if the user has permission). I > can think of a few different ways of implementing rules about > uniqueness of combinations of fields as follows: > 1. Determine a set of rules specific to the upload tools. The > upload tools will handle data in that way, but nothing else has to. > 2. Formally document those rules as part of GDPDM. > 3. Create unique indexes in GDPDM implementing those rules. I'm a big proponent of automatic code generation when it makes sense. For this project, given the number of tables, I've definitely taken this approach and have therefore spent a lot of time agonizing over the structure of the schema, e.g., field order and such. I believe my problems would be solved with the addition of UNIQUE constraints in the database, be they on single or multiple fields. So, I believe your #3 suggestion takes care of 1 and 2, too. However, I'm on the fence with relying solely on multiple fields to uniquely identify a record. I know this would work, e.g., a species could be said to be a unique combination of genus, species, subspecies, subtaxa, race and population, but wouldn't it be easier to just say that a new "accession" field uniquely identifies the record? It's a difference in using six fields instead of one. The sheer number plus the factor of human mistake in misspelling just one of those fields raises the distinct probability that data will be duplicated. Further, it makes my earlier "linking in" problem more problematic. E.g., would the URL to get to a specific taxonomy record have to look like this: /db/diversity_view? action=view&object=div_taxonomy&genus=Oryza&species=sativa&subspecies=ja ponica&subtaxa=temperate&race=foo&population=bar Wouldn't this be a lot easier to generate? /db/diversity_view?action=view&object=div_taxonomy&accession=osat > It probably makes sense to do 1 first, then once there is a working > rule set consider formally documenting and implementing them as > indexes. Yes, I agree completely. Either way, we're talking about a schema change, though adding constraints is far less instrusive than adding fields. There again, if you try to add UNIQUE constraints to fields that have duplicated data already, then it will be a pretty big burden to fix the data, too, I guess. Adding a new "accession" field could be easy enough in that, to get started, you could use a script to assign the PK value to the accession and then let the curator put something more meaningful there later. This would work with databases that currently have duplication. > Another really sticky problem is you do not want to create new > germplasm or other types of records just because a user spelled > something in one of the key fields wrong. For example, B73Ht, > B73H, B73 Ht, and B73_Ht could all be reasonable ways to represent > the same accession. The only solution that I came up with was to > provide the user with a way to look up information from the > database. For example, show me all the records that start with B7. I believe this would be exactly the kind of problem that could be solved by the addition of an accession plus careful data curation. Regardless of how the germplasm's name was put into the database, a group would know that they've assigned it the accession "b73ht" and would just use that. > A third item, which I assume is a major driver behind Ken's > suggestion, is that when adding records to a table, a user must > establish the correct references to linked tables. In other words, > the user either has to enter the correct FK's or provide enough > information that they can be correctly determined. I take Ken's > suggestion as using a text field or mnemonic to represent the > FK's. In writing the Excel upload tool, I actually made use of all > three methods where each seemed to make sense. > > To upload germplasm records, I used a single spreadsheet (or view) > to upload data to the div_passport, div_stock, and div_stock_parent > tables. All the individual fields necessary for determining some > of the FK's were there. For taxonomy_id, I required that the user > look up the correct id from the taxonomy table and enter the actual > FK value. That works because it is likely an entire upload will > use only a single value. For each row in the germplasm table, the > user enters a UserTaxaName. The UserTaxaName is used in other > spreadsheets to refer to that germplasm entry. In the upload > process, it is used to look up a div_stock_id in the germplasm > spreadsheet. Essentially, the Excel file stores a set of text > mnenomics for FK's specific to that Excel workbook. So, it's not > much of a stretch to use a database table connected with the upload > tools to store information like that, either globally for all users > or on a per user basis. To quote an ex-President, "I feel your pain" on the Excel stuff. I, too, am letting Isaak start from an Excel file to organize the data [1], though he's having to keep up only with the PK values, which he has to assign, in order to relate the data in the different worksheets. I generate the Excel file directly from the schema definition in order to ensure that all the identifiers are present and properly spelled, so there are no helper fields as you have created. You are correct that one of the things I'm shooting for is to add mnenomic devices for more easily relating data. I don't intend to my Excel "interface" to be my last effort[2], so perhaps this won't always be the most important point for me. However, the linking in question would continue to be resolved by the addition of such a mnemonic. I believe my suggestion is in the spirit of Einstein's suggestion that "things should be made as simple as possible -- but no simpler." Adding just one field that always must be unique is easier than deciding which combination, if any, or other fields is unique. For those disinclined to use an accession, it could automatically be assigned the PK value of the record, so it would be no burden. What I need is for the schema itself to be explicit about what uniquely identifies records, so I believe the schema must be changed to address this issue either way. ky [1] FWIW, the Excel file is transformed into an structured text file (S-expressions, indented text, or XML) before being fed to my upload tool, removing any dependence on Bill Gates or his efforts to make different versions of Excel incompatible and to force everyone to upgrade to the latest version. [2] I'm investigating a "Ruby on Rails"-like framework for Perl called "Catalyst" that should be able to build on the code classes I've already generated for my import and viewing tools. |