From: Bindu G. <bi...@sa...> - 2003-07-21 14:52:17
|
Terry - At 23:17 -0500 7/20/03, Terry Clark wrote: >Hi Bindu and others, >You might have seen that I made a few minor >changes to > GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome() >to get the TIGR Arabidopsis genome uploaded into GUS working >because the sequence identification mapping is not yet worked out. >I have some questions and comments about this. > >In the plugin at makeChromosome() there is a query performed >to determine whether the submitted sequence is in the db. >The approach that is in place uses a query selecting entries >based on attributes values for: > i) taxonomy, ii) external_release_id, and iii) source_id >There is some code commented out here, since the plugin >is under construction. You are right about the fact that (taxon_id, external_database_release_id, source_id) is being used to attempt retrieval of the sequence. # NOTE: for falciparum XML, map_asmbl_id_to_source_id method was used to set # the source_id appropriately. $enaSeq{source_id} = $self->map_asmbl_id_to_source_id($T); ... i.e. picking source_id from TIGR XML Code commented out in the plugin, before attempting retrieval of the externalNASequence, is : # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d) # so, source_id needs to be cushioned with 0s (zeroes) #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content}; #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; } #$enaSeq{source_id} = 'chrPyl_' . $tmpStr; The comment is *not* because plugin is still being worked on. It exists as the source_id is being constructed from such a line in the P_yoelii XML file: <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID> source_id here is chrPyl_00111 in GUS, and so needed the different (than the P falciparum XML files) mechanism for the source_id >It's not clear to me that the intent of source_id is >to uniquely identify the sequence. Is that so? yes, that is so, along with taxon_id and external_database_release_id. >Presently, source_id is constructed from with the hardwiring >in the mapping vvvv that we're working on. > source_id = ASMBL_ID_2_stringMap + ASMBL_ID >The source_id above is a unique name by virtue of ASMBL_ID. > >On the other hand, NAME is an attractive self-documenting GUS >attribute that tempts me to use it for the query checking for novelty. >In the plugin, NAME was/is set to CLONE_NAME, which seems fine, >although with the caveat for TIGR, this is labeled as for their >internal use. > >In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates >for uniquely naming a sequence. TIGR documents CLONE_NAME for internal >use and ASMBL_ID for external use in uniquely identifying a sequence. > >I punted for now by adding the condition to the query > $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME}) >i.e., the attributes > i) taxonomy, ii) external_release_id, and iii) CLONE_NAME > >What do you think? Perhaps this will work well. I have only looked at XML files of falciparum and yoelii so far, and that too, only 1 revision of these. So, not much experience with all this. Else, we should think of a strategy (a method within the plugin) to modify source_id after collecting field contents of some XML tag/attribute, and having specified some choice parameters. INHO this doesn't sound like a good way to go for various reasons. Bindu >Terry > > > > >------------------------------------------------------- >This SF.net email is sponsored by: VM Ware >With VMware you can run multiple operating systems on a single machine. >WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the >same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |