Thread: [Gusdev-gusdev] unique sequence designator for DoTS.NASequence

Brought to you by: delagoya, jcrabtree, msaffitz, stevefischer, stoeckert

gusdev-gusdev

[Gusdev-gusdev] unique sequence designator for DoTS.NASequence

From: Terry C. <tw...@cs...> - 2003-07-21 04:17:09

Hi Bindu and others,  
You might have seen that I made a few minor
changes to 
   GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome()
to get the TIGR Arabidopsis genome uploaded into GUS working 
because the sequence identification mapping is not yet worked out.
I have some questions and comments about this.

In the plugin at makeChromosome() there is a query performed 
to determine whether the submitted sequence is in the db. 
The approach that is in place uses a query selecting entries 
based on attributes values for:
   i) taxonomy, ii) external_release_id, and iii) source_id
There is some code commented out here, since the plugin 
is under construction.

It's not clear to me that the intent of source_id is
to uniquely identify the sequence. Is that so?
Presently, source_id is constructed from with the hardwiring
in the mapping         vvvv that we're working on.
      source_id = ASMBL_ID_2_stringMap + ASMBL_ID
The source_id above is a unique name by virtue of ASMBL_ID.

On the other hand, NAME is an attractive self-documenting GUS 
attribute that tempts me to use it for the query checking for novelty.
In the plugin, NAME was/is set to CLONE_NAME, which seems fine,
although with the caveat for TIGR, this is labeled as for their
internal use.

In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates 
for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
use and ASMBL_ID for external use in uniquely identifying a sequence.

I punted for now by adding the condition to the query
   $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME})
i.e., the attributes
   i) taxonomy, ii) external_release_id, and iii) CLONE_NAME

What do you think? 

Terry

Re: [Gusdev-gusdev] unique sequence designator for DoTS.NASequence

From: Steve F. <st...@pc...> - 2003-07-21 12:52:23

terry-


in general external_database_release_id and source_id should be unique, 
if, of course, the source_id is a primary key for the external db.

steve

Terry Clark wrote:

>Hi Bindu and others,  
>You might have seen that I made a few minor
>changes to 
>   GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome()
>to get the TIGR Arabidopsis genome uploaded into GUS working 
>because the sequence identification mapping is not yet worked out.
>I have some questions and comments about this.
>
>In the plugin at makeChromosome() there is a query performed 
>to determine whether the submitted sequence is in the db. 
>The approach that is in place uses a query selecting entries 
>based on attributes values for:
>   i) taxonomy, ii) external_release_id, and iii) source_id
>There is some code commented out here, since the plugin 
>is under construction.
>
>It's not clear to me that the intent of source_id is
>to uniquely identify the sequence. Is that so?
>Presently, source_id is constructed from with the hardwiring
>in the mapping         vvvv that we're working on.
>      source_id = ASMBL_ID_2_stringMap + ASMBL_ID
>The source_id above is a unique name by virtue of ASMBL_ID.
>
>On the other hand, NAME is an attractive self-documenting GUS 
>attribute that tempts me to use it for the query checking for novelty.
>In the plugin, NAME was/is set to CLONE_NAME, which seems fine,
>although with the caveat for TIGR, this is labeled as for their
>internal use.
>
>In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates 
>for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
>use and ASMBL_ID for external use in uniquely identifying a sequence.
>
>I punted for now by adding the condition to the query
>   $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME})
>i.e., the attributes
>   i) taxonomy, ii) external_release_id, and iii) CLONE_NAME
>
>What do you think? 
>
>Terry
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: VM Ware
>With VMware you can run multiple operating systems on a single machine.
>WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
>same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
>_______________________________________________
>Gusdev-gusdev mailing list
>Gus...@li...
>https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>  
>

Re: [Gusdev-gusdev] unique sequence designator for DoTS.NASequence

From: Bindu G. <bi...@sa...> - 2003-07-21 14:52:17

Terry -

At 23:17 -0500 7/20/03, Terry Clark wrote:
>Hi Bindu and others, 
>You might have seen that I made a few minor
>changes to
>    GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome()
>to get the TIGR Arabidopsis genome uploaded into GUS working
>because the sequence identification mapping is not yet worked out.
>I have some questions and comments about this.
>
>In the plugin at makeChromosome() there is a query performed
>to determine whether the submitted sequence is in the db.
>The approach that is in place uses a query selecting entries
>based on attributes values for:
>    i) taxonomy, ii) external_release_id, and iii) source_id
>There is some code commented out here, since the plugin
>is under construction.

You are right about the fact that
(taxon_id, external_database_release_id, source_id) is being used
to attempt retrieval of the sequence.

   # NOTE: for falciparum XML, map_asmbl_id_to_source_id method was used to set
   #       the source_id appropriately.
   $enaSeq{source_id} = $self->map_asmbl_id_to_source_id($T);

... i.e. picking source_id from TIGR XML


Code commented out in the plugin, before attempting retrieval
of the externalNASequence, is :

   # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d)
   #       so, source_id needs to be cushioned with 0s (zeroes)
   #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content};
   #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; }
   #$enaSeq{source_id} = 'chrPyl_' . $tmpStr;

The comment is *not* because plugin is still being worked on. It
exists as the source_id is being constructed from such a line in
the P_yoelii XML file:

<ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID>

source_id here is chrPyl_00111 in GUS, and so needed the different
(than the P falciparum XML files) mechanism for the source_id


>It's not clear to me that the intent of source_id is
>to uniquely identify the sequence. Is that so?

yes, that is so, along with taxon_id and external_database_release_id.

>Presently, source_id is constructed from with the hardwiring
>in the mapping         vvvv that we're working on.
>       source_id = ASMBL_ID_2_stringMap + ASMBL_ID
>The source_id above is a unique name by virtue of ASMBL_ID.
>
>On the other hand, NAME is an attractive self-documenting GUS
>attribute that tempts me to use it for the query checking for novelty.
>In the plugin, NAME was/is set to CLONE_NAME, which seems fine,
>although with the caveat for TIGR, this is labeled as for their
>internal use.
>
>In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates
>for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
>use and ASMBL_ID for external use in uniquely identifying a sequence.
>
>I punted for now by adding the condition to the query
>    $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME})
>i.e., the attributes
>    i) taxonomy, ii) external_release_id, and iii) CLONE_NAME
>
>What do you think?

Perhaps this will work well. I have only looked at XML files of
falciparum and yoelii so far, and that too, only 1 revision of
these. So, not much experience with all this.

Else, we should think of a strategy (a method within the plugin)
to modify source_id after collecting field contents of some XML
tag/attribute, and having specified some choice parameters. INHO
this doesn't sound like a good way to go for various reasons.


Bindu



>Terry
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: VM Ware
>With VMware you can run multiple operating systems on a single machine.
>WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
>same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
>_______________________________________________
>Gusdev-gusdev mailing list
>Gus...@li...
>https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev

Re: [Gusdev-gusdev] unique sequence designator for DoTS.NASequence

From: Jonathan C. <cra...@pc...> - 2003-07-22 15:07:31

Terry, Bindu-

Bindu Gajria wrote:
> You are right about the fact that
> (taxon_id, external_database_release_id, source_id) is being used
> to attempt retrieval of the sequence.

As I think Steve mentioned, whenever we want to directly reference an
entry in an external database from within GUS, we use the following pair
of attributes in the referencing table (ExternalNASequence in this case):

external_database_release_id
source_id

The naming is obviously bad, because there's nothing to indicate that
the two columns are related in any way.  We have adopted the convention
that this pair of attributes should *uniquely* identify a single entry
in the referenced database.  In practice, however, there is nothing to
enforce this convention, and so it has likely been violated in our
current database.  Here's an example where there appear to be duplicate
rows for a given external_database_release_id, source_id (which actually
may or may not be a violation of this specific convention, depending on
whether the two sequences in question really are the same):

SQL> select external_database_release_id from dots.externalnasequence
       where source_id = 'AJ276847';

EXTERNAL_DATABASE_RELEASE_ID
----------------------------
			   2
			   2

In the above SQL you'll also note that we're not using the full GenBank
accession number (e.g. AJ276847.1), which means that to find the correct
GenBank entry you have to do one of two things:

1. Look at the NCBI "GI", stored (also by convention) in the
    secondary_identifier column.  (In my example above one of the
    two sequences doesn't have a secondary_identifier.)
2. Use the GenBank release number (133) stored in the
    sres.ExternalDatabaseRelease table, to determine which version of
    AJ276847 was the one included in GenBank release 133.

#2 is extremely inconvenient and so, although we may be sticking to the
letter of our convention here (that external_database_release_id and
source_id alone uniquely identify the referenced entry), we're not doing
a good job of following its spirit.

> Code commented out in the plugin, before attempting retrieval
> of the externalNASequence, is :
> 
>   # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d)
>   #       so, source_id needs to be cushioned with 0s (zeroes)
>   #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content};
>   #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; }
>   #$enaSeq{source_id} = 'chrPyl_' . $tmpStr;
> 
> The comment is *not* because plugin is still being worked on. It
> exists as the source_id is being constructed from such a line in
> the P_yoelii XML file:
> 
> <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID>
> 
> source_id here is chrPyl_00111 in GUS, and so needed the different
> (than the P falciparum XML files) mechanism for the source_id

Are you saying that to run the plugin on yoelii one has to uncomment
the above code?  If so then I agree with Terry's assessment that the
code is still under construction; at the very least it should be
possible to check (or require that the user specify) the taxon/species
of the data to be loaded, and then select a method of constructing the
source_id based on that.  Making the user comment or uncomment code
in order to get a program to work is not a good practice in general
(though I'm certainly guilty of doing it in code that only I use)
and it's an even worse practice in the GUS plugin framework, because
when you make any change to the plugin the system will force you to
update the AlgorithmImplementation entry in the database.  Therefore
alternately running the plugin on falciparum and yoelii would result
in a long stream of apparent updates to the plugin (in the
AlgorithmImplementation table) even though in reality nothing has
changed except the commenting/uncommenting of this small piece of
code.

(p.s. a single "sprintf" would be a much more concise way of generating
the 0-padded yoelii id than the while loop.)

>> It's not clear to me that the intent of source_id is
>> to uniquely identify the sequence. Is that so?
> 
> yes, that is so, along with taxon_id and external_database_release_id.

Strike taxon_id from the above statement.

>> In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative 
>> candidates
>> for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
>> use and ASMBL_ID for external use in uniquely identifying a sequence.

In that case we would use ASMBL_ID as the source_id and CLONE_NAME as
the secondary_identifier.  We typically only use the GUS "name" column
if we have a third id or name that hasn't been stored in either the
source_id or the secondary_identifer.  When loading GenBank entries we
use it for the locus_id.

Jonathan

p.s. The above discussion covers direct links to external databases.  We
also have indirect links, which look like this:

dots.NAFeature -> dots.NADbRefNAFeature -> sres.DbRef -> [external db]

Typically the direct links are used when there is a clear 1-1 relationship
between an entry in GUS and an entry in an external database, and the usual
interpretation of such a link is that the GUS entry was *loaded from* the
referenced external database entry.  The links that go through DbRef, on
the other hand, need not be 1-1, and the semantics are more varied (i.e.
this thing links to that one, but they're not necessarily the same, and
nor is one necessarily loaded from or derived from the other.)