From: Terry C. <tw...@cs...> - 2003-07-21 04:17:09
|
Hi Bindu and others, You might have seen that I made a few minor changes to GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome() to get the TIGR Arabidopsis genome uploaded into GUS working because the sequence identification mapping is not yet worked out. I have some questions and comments about this. In the plugin at makeChromosome() there is a query performed to determine whether the submitted sequence is in the db. The approach that is in place uses a query selecting entries based on attributes values for: i) taxonomy, ii) external_release_id, and iii) source_id There is some code commented out here, since the plugin is under construction. It's not clear to me that the intent of source_id is to uniquely identify the sequence. Is that so? Presently, source_id is constructed from with the hardwiring in the mapping vvvv that we're working on. source_id = ASMBL_ID_2_stringMap + ASMBL_ID The source_id above is a unique name by virtue of ASMBL_ID. On the other hand, NAME is an attractive self-documenting GUS attribute that tempts me to use it for the query checking for novelty. In the plugin, NAME was/is set to CLONE_NAME, which seems fine, although with the caveat for TIGR, this is labeled as for their internal use. In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates for uniquely naming a sequence. TIGR documents CLONE_NAME for internal use and ASMBL_ID for external use in uniquely identifying a sequence. I punted for now by adding the condition to the query $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME}) i.e., the attributes i) taxonomy, ii) external_release_id, and iii) CLONE_NAME What do you think? Terry |