Re: [Gusdev-gusdev] unique sequence designator for DoTS.NASequence

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Terry, Bindu-

Bindu Gajria wrote:
> You are right about the fact that
> (taxon_id, external_database_release_id, source_id) is being used
> to attempt retrieval of the sequence.

As I think Steve mentioned, whenever we want to directly reference an
entry in an external database from within GUS, we use the following pair
of attributes in the referencing table (ExternalNASequence in this case):

external_database_release_id
source_id

The naming is obviously bad, because there's nothing to indicate that
the two columns are related in any way.  We have adopted the convention
that this pair of attributes should *uniquely* identify a single entry
in the referenced database.  In practice, however, there is nothing to
enforce this convention, and so it has likely been violated in our
current database.  Here's an example where there appear to be duplicate
rows for a given external_database_release_id, source_id (which actually
may or may not be a violation of this specific convention, depending on
whether the two sequences in question really are the same):

SQL> select external_database_release_id from dots.externalnasequence
       where source_id = 'AJ276847';

EXTERNAL_DATABASE_RELEASE_ID
----------------------------
			   2
			   2

In the above SQL you'll also note that we're not using the full GenBank
accession number (e.g. AJ276847.1), which means that to find the correct
GenBank entry you have to do one of two things:

1. Look at the NCBI "GI", stored (also by convention) in the
    secondary_identifier column.  (In my example above one of the
    two sequences doesn't have a secondary_identifier.)
2. Use the GenBank release number (133) stored in the
    sres.ExternalDatabaseRelease table, to determine which version of
    AJ276847 was the one included in GenBank release 133.

#2 is extremely inconvenient and so, although we may be sticking to the
letter of our convention here (that external_database_release_id and
source_id alone uniquely identify the referenced entry), we're not doing
a good job of following its spirit.

> Code commented out in the plugin, before attempting retrieval
> of the externalNASequence, is :
> 
>   # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d)
>   #       so, source_id needs to be cushioned with 0s (zeroes)
>   #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content};
>   #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; }
>   #$enaSeq{source_id} = 'chrPyl_' . $tmpStr;
> 
> The comment is *not* because plugin is still being worked on. It
> exists as the source_id is being constructed from such a line in
> the P_yoelii XML file:
> 
> <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID>
> 
> source_id here is chrPyl_00111 in GUS, and so needed the different
> (than the P falciparum XML files) mechanism for the source_id

Are you saying that to run the plugin on yoelii one has to uncomment
the above code?  If so then I agree with Terry's assessment that the
code is still under construction; at the very least it should be
possible to check (or require that the user specify) the taxon/species
of the data to be loaded, and then select a method of constructing the
source_id based on that.  Making the user comment or uncomment code
in order to get a program to work is not a good practice in general
(though I'm certainly guilty of doing it in code that only I use)
and it's an even worse practice in the GUS plugin framework, because
when you make any change to the plugin the system will force you to
update the AlgorithmImplementation entry in the database.  Therefore
alternately running the plugin on falciparum and yoelii would result
in a long stream of apparent updates to the plugin (in the
AlgorithmImplementation table) even though in reality nothing has
changed except the commenting/uncommenting of this small piece of
code.

(p.s. a single "sprintf" would be a much more concise way of generating
the 0-padded yoelii id than the while loop.)

>> It's not clear to me that the intent of source_id is
>> to uniquely identify the sequence. Is that so?
> 
> yes, that is so, along with taxon_id and external_database_release_id.

Strike taxon_id from the above statement.

>> In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative 
>> candidates
>> for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
>> use and ASMBL_ID for external use in uniquely identifying a sequence.

In that case we would use ASMBL_ID as the source_id and CLONE_NAME as
the secondary_identifier.  We typically only use the GUS "name" column
if we have a third id or name that hasn't been stored in either the
source_id or the secondary_identifer.  When loading GenBank entries we
use it for the locus_id.

Jonathan

p.s. The above discussion covers direct links to external databases.  We
also have indirect links, which look like this:

dots.NAFeature -> dots.NADbRefNAFeature -> sres.DbRef -> [external db]

Typically the direct links are used when there is a clear 1-1 relationship
between an entry in GUS and an entry in an external database, and the usual
interpretation of such a link is that the GUS entry was *loaded from* the
referenced external database entry.  The links that go through DbRef, on
the other hand, need not be 1-1, and the semantics are more varied (i.e.
this thing links to that one, but they're not necessarily the same, and
nor is one necessarily loaded from or derived from the other.)