[Gusdev-gusdev] unique sequence designator for DoTS.NASequence

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Bindu and others,  
You might have seen that I made a few minor
changes to 
   GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome()
to get the TIGR Arabidopsis genome uploaded into GUS working 
because the sequence identification mapping is not yet worked out.
I have some questions and comments about this.

In the plugin at makeChromosome() there is a query performed 
to determine whether the submitted sequence is in the db. 
The approach that is in place uses a query selecting entries 
based on attributes values for:
   i) taxonomy, ii) external_release_id, and iii) source_id
There is some code commented out here, since the plugin 
is under construction.

It's not clear to me that the intent of source_id is
to uniquely identify the sequence. Is that so?
Presently, source_id is constructed from with the hardwiring
in the mapping         vvvv that we're working on.
      source_id = ASMBL_ID_2_stringMap + ASMBL_ID
The source_id above is a unique name by virtue of ASMBL_ID.

On the other hand, NAME is an attractive self-documenting GUS 
attribute that tempts me to use it for the query checking for novelty.
In the plugin, NAME was/is set to CLONE_NAME, which seems fine,
although with the caveat for TIGR, this is labeled as for their
internal use.

In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates 
for uniquely naming a sequence. TIGR documents CLONE_NAME for internal
use and ASMBL_ID for external use in uniquely identifying a sequence.

I punted for now by adding the condition to the query
   $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME})
i.e., the attributes
   i) taxonomy, ii) external_release_id, and iii) CLONE_NAME

What do you think? 

Terry