Re: [Gdpdm-devel] New accession identifier in GDPDM

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Nov 3, 2005, at 9:57 AM, Peter Bradbury wrote:

> In response to Kens suggestion that new identifiers be added to  
> GDPDM, my feeling is that it is unnecessary at best and may  
> actually make the task of database maintenance more complicated not  
> less.  One of the issues that  Ken raises is how to tell whether a  
> proposed addition to the database already exists or is new.  That  
> is indeed a difficult issue to address for several reasons.   
> However, the information needed to answer the question already  
> exists in the database.

The information does not currently exist as an explicit part of the  
schema definition.  I'm asking that each table have a UNIQUE value  
apart from it's primary key.

> Adding another field that simply combines or duplicates existing  
> information doesn't help and violates good database design principles.

I don't believe most accessions duplicate data.  E.g., a GenBank  
accession such as "D50439" doesn't duplicate anything about the  
marker "P108," in fact it quite easily helps differentiate it from  
any other marker that might happen to be named "P108," and, further,  
allows or even encourages the duplication of marker names, which is a  
Good Thing (tm) in that it allows the database to reflect the reality  
that biologists sometimes choose the same name for different things.   
If we were to put a UNIQUE constraint on marker names, we probably  
won't be able to put in some rice SSRs, e.g.:

	http://dev.gramene.org/db/markers/marker_view?marker_name=rm1003

> That does not, however, mean that database tables that specifically  
> support the upload tools should not be created.

I don't believe it's necessary to add new tables, just UNIQUE  
constraints.

> They would not be part of GDPDM but part of the upload tool  
> application.  I would suggest keeping them in a separate schema.   
> During development they could even be on a different machine.  That  
> would provide complete freedom to use text fields to represent  
> FK's, deal with security issues that may require storage of  
> information about users, etc.

I believe this suggestion is actually the one where data would be  
duplicated, only it would be in extra tables possibly in external  
databases.  Not being an integral part of the schema, it would also  
not help me in integrating the GDPDM data into the rest of Gramene,  
e.g., how do I easily link into species or germplasm data without  
relying on knowing the primary key value (which might change)?  That  
is, from our marker or comparative maps, I want to show the user more  
about what we're calling "rice," which is actually "Oryza sativa  
japonica," so I want to link to a GDPDM viewer using some value other  
than "42," something with a modicum of meaning like "osat" and  
something which I know won't change if the data gets updated or  
reloaded.

> One thing missing from GDPDM that we might want to consider adding  
> is information about which combination of fields in each table  
> determine a unique record.  If all those fields do not match then a  
> record is new.  If all those fields match, then any other fields  
> that may be different get updated (if the user has permission).  I  
> can think of a few different ways of implementing rules about  
> uniqueness of combinations of fields as follows:
> 1.  Determine a set of rules specific to the upload tools.  The  
> upload tools will handle data in that way, but nothing else has to.
> 2.  Formally document those rules as part of GDPDM.
> 3.  Create unique indexes in GDPDM implementing those rules.

I'm a big proponent of automatic code generation when it makes  
sense.  For this project, given the number of tables, I've definitely  
taken this approach and have therefore spent a lot of time agonizing  
over the structure of the schema, e.g., field order and such.  I  
believe my problems would be solved with the addition of UNIQUE  
constraints in the database, be they on single or multiple fields.   
So, I believe your #3 suggestion takes care of 1 and 2, too.

However, I'm on the fence with relying solely on multiple fields to  
uniquely identify a record.  I know this would work, e.g., a species  
could be said to be a unique combination of genus, species,  
subspecies, subtaxa, race and population, but wouldn't it be easier  
to just say that a new "accession" field uniquely identifies the  
record?  It's a difference in using six fields instead of one.  The  
sheer number plus the factor of human mistake in misspelling just one  
of those fields raises the distinct probability that data will be  
duplicated.

Further, it makes my earlier "linking in" problem more problematic.   
E.g., would the URL to get to a specific taxonomy record have to look  
like this:

   /db/diversity_view? 
action=view&object=div_taxonomy&genus=Oryza&species=sativa&subspecies=ja 
ponica&subtaxa=temperate&race=foo&population=bar

Wouldn't this be a lot easier to generate?

   /db/diversity_view?action=view&object=div_taxonomy&accession=osat

> It probably makes sense to do 1 first, then once there is a working  
> rule set consider formally documenting and implementing them as  
> indexes.

Yes, I agree completely.  Either way, we're talking about a schema  
change, though adding constraints is far less instrusive than adding  
fields.  There again, if you try to add UNIQUE constraints to fields  
that have duplicated data already, then it will be a pretty big  
burden to fix the data, too, I guess.  Adding a new "accession" field  
could be easy enough in that, to get started, you could use a script  
to assign the PK value to the accession and then let the curator put  
something more meaningful there later.  This would work with  
databases that currently have duplication.

> Another really sticky problem is you do not want to create new  
> germplasm or other types of records just because a user spelled  
> something in one of  the key fields wrong.  For example, B73Ht,  
> B73H, B73 Ht, and B73_Ht could all be reasonable ways to represent  
> the same accession.  The only solution that I came up with was to  
> provide the user with a way to look up information from the  
> database.  For example, show me all the records that start with B7.

I believe this would be exactly the kind of problem that could be  
solved by the addition of an accession plus careful data curation.   
Regardless of how the germplasm's name was put into the database, a  
group would know that they've assigned it the accession "b73ht" and  
would just use that.

> A third item, which I assume is a major driver behind Ken's  
> suggestion, is that when adding records to a table, a user must  
> establish the correct references to linked tables.  In other words,  
> the user either has to enter the correct FK's or provide enough  
> information that they can be correctly determined.  I take Ken's  
> suggestion as using a text field or mnemonic to represent the  
> FK's.  In writing the Excel upload tool, I actually made use of all  
> three methods where each seemed to make sense.
>
> To upload germplasm records, I used a single spreadsheet (or view)  
> to upload data to the div_passport, div_stock, and div_stock_parent  
> tables.  All the individual fields necessary for determining some  
> of the FK's were there.  For taxonomy_id, I required that the user  
> look up the correct id from the taxonomy table and enter the actual  
> FK value.  That works because it is likely an entire upload will  
> use only a single value.  For each row in the germplasm table, the  
> user enters a UserTaxaName.  The UserTaxaName is used in other  
> spreadsheets to refer to that germplasm entry.  In the upload  
> process, it is used to look up a div_stock_id in the germplasm  
> spreadsheet.  Essentially, the Excel file stores a set of text  
> mnenomics for FK's specific to that Excel workbook.  So, it's not  
> much of a stretch to use a database table connected with the upload  
> tools to store information like that, either globally for all users  
> or on a per user basis.

To quote an ex-President, "I feel your pain" on the Excel stuff.  I,  
too, am letting Isaak start from an Excel file to organize the data 
[1], though he's having to keep up only with the PK values, which he  
has to assign, in order to relate the data in the different  
worksheets.  I generate the Excel file directly from the schema  
definition in order to ensure that all the identifiers are present  
and properly spelled, so there are no helper fields as you have  
created.  You are correct that one of the things I'm shooting for is  
to add mnenomic devices for more easily relating data.  I don't  
intend to my Excel "interface" to be my last effort[2], so perhaps  
this won't always be the most important point for me.  However, the  
linking in question would continue to be resolved by the addition of  
such a mnemonic.

I believe my suggestion is in the spirit of Einstein's suggestion  
that "things should be made as simple as possible -- but no  
simpler."  Adding just one field that always must be unique is easier  
than deciding which combination, if any, or other fields is unique.   
For those disinclined to use an accession, it could automatically be  
assigned the PK value of the record, so it would be no burden.  What  
I need is for the schema itself to be explicit about what uniquely  
identifies records, so I believe the schema must be changed to  
address this issue either way.

ky

[1] FWIW, the Excel file is transformed into an structured text file  
(S-expressions, indented text, or XML) before being fed to my upload  
tool, removing any dependence on Bill Gates or his efforts to make  
different versions of Excel incompatible and to force everyone to  
upgrade to the latest version.

[2] I'm investigating a "Ruby on Rails"-like framework for Perl  
called "Catalyst" that should be able to build on the code classes  
I've already generated for my import and viewing tools.