From: Terry C. <tw...@cs...> - 2003-07-21 04:17:09
|
Hi Bindu and others, You might have seen that I made a few minor changes to GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome() to get the TIGR Arabidopsis genome uploaded into GUS working because the sequence identification mapping is not yet worked out. I have some questions and comments about this. In the plugin at makeChromosome() there is a query performed to determine whether the submitted sequence is in the db. The approach that is in place uses a query selecting entries based on attributes values for: i) taxonomy, ii) external_release_id, and iii) source_id There is some code commented out here, since the plugin is under construction. It's not clear to me that the intent of source_id is to uniquely identify the sequence. Is that so? Presently, source_id is constructed from with the hardwiring in the mapping vvvv that we're working on. source_id = ASMBL_ID_2_stringMap + ASMBL_ID The source_id above is a unique name by virtue of ASMBL_ID. On the other hand, NAME is an attractive self-documenting GUS attribute that tempts me to use it for the query checking for novelty. In the plugin, NAME was/is set to CLONE_NAME, which seems fine, although with the caveat for TIGR, this is labeled as for their internal use. In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates for uniquely naming a sequence. TIGR documents CLONE_NAME for internal use and ASMBL_ID for external use in uniquely identifying a sequence. I punted for now by adding the condition to the query $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME}) i.e., the attributes i) taxonomy, ii) external_release_id, and iii) CLONE_NAME What do you think? Terry |
From: Steve F. <st...@pc...> - 2003-07-21 12:52:23
|
terry- in general external_database_release_id and source_id should be unique, if, of course, the source_id is a primary key for the external db. steve Terry Clark wrote: >Hi Bindu and others, >You might have seen that I made a few minor >changes to > GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome() >to get the TIGR Arabidopsis genome uploaded into GUS working >because the sequence identification mapping is not yet worked out. >I have some questions and comments about this. > >In the plugin at makeChromosome() there is a query performed >to determine whether the submitted sequence is in the db. >The approach that is in place uses a query selecting entries >based on attributes values for: > i) taxonomy, ii) external_release_id, and iii) source_id >There is some code commented out here, since the plugin >is under construction. > >It's not clear to me that the intent of source_id is >to uniquely identify the sequence. Is that so? >Presently, source_id is constructed from with the hardwiring >in the mapping vvvv that we're working on. > source_id = ASMBL_ID_2_stringMap + ASMBL_ID >The source_id above is a unique name by virtue of ASMBL_ID. > >On the other hand, NAME is an attractive self-documenting GUS >attribute that tempts me to use it for the query checking for novelty. >In the plugin, NAME was/is set to CLONE_NAME, which seems fine, >although with the caveat for TIGR, this is labeled as for their >internal use. > >In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates >for uniquely naming a sequence. TIGR documents CLONE_NAME for internal >use and ASMBL_ID for external use in uniquely identifying a sequence. > >I punted for now by adding the condition to the query > $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME}) >i.e., the attributes > i) taxonomy, ii) external_release_id, and iii) CLONE_NAME > >What do you think? > >Terry > > > > >------------------------------------------------------- >This SF.net email is sponsored by: VM Ware >With VMware you can run multiple operating systems on a single machine. >WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the >same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > |
From: Bindu G. <bi...@sa...> - 2003-07-21 14:52:17
|
Terry - At 23:17 -0500 7/20/03, Terry Clark wrote: >Hi Bindu and others, >You might have seen that I made a few minor >changes to > GUS::Common::Plugin::LoadGeneFeaturesFromXML::makeChromosome() >to get the TIGR Arabidopsis genome uploaded into GUS working >because the sequence identification mapping is not yet worked out. >I have some questions and comments about this. > >In the plugin at makeChromosome() there is a query performed >to determine whether the submitted sequence is in the db. >The approach that is in place uses a query selecting entries >based on attributes values for: > i) taxonomy, ii) external_release_id, and iii) source_id >There is some code commented out here, since the plugin >is under construction. You are right about the fact that (taxon_id, external_database_release_id, source_id) is being used to attempt retrieval of the sequence. # NOTE: for falciparum XML, map_asmbl_id_to_source_id method was used to set # the source_id appropriately. $enaSeq{source_id} = $self->map_asmbl_id_to_source_id($T); ... i.e. picking source_id from TIGR XML Code commented out in the plugin, before attempting retrieval of the externalNASequence, is : # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d) # so, source_id needs to be cushioned with 0s (zeroes) #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content}; #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; } #$enaSeq{source_id} = 'chrPyl_' . $tmpStr; The comment is *not* because plugin is still being worked on. It exists as the source_id is being constructed from such a line in the P_yoelii XML file: <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID> source_id here is chrPyl_00111 in GUS, and so needed the different (than the P falciparum XML files) mechanism for the source_id >It's not clear to me that the intent of source_id is >to uniquely identify the sequence. Is that so? yes, that is so, along with taxon_id and external_database_release_id. >Presently, source_id is constructed from with the hardwiring >in the mapping vvvv that we're working on. > source_id = ASMBL_ID_2_stringMap + ASMBL_ID >The source_id above is a unique name by virtue of ASMBL_ID. > >On the other hand, NAME is an attractive self-documenting GUS >attribute that tempts me to use it for the query checking for novelty. >In the plugin, NAME was/is set to CLONE_NAME, which seems fine, >although with the caveat for TIGR, this is labeled as for their >internal use. > >In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative candidates >for uniquely naming a sequence. TIGR documents CLONE_NAME for internal >use and ASMBL_ID for external use in uniquely identifying a sequence. > >I punted for now by adding the condition to the query > $ena_gus->setName($T->{ASSEMBLY}->{HEADER}->{CLONE_NAME}) >i.e., the attributes > i) taxonomy, ii) external_release_id, and iii) CLONE_NAME > >What do you think? Perhaps this will work well. I have only looked at XML files of falciparum and yoelii so far, and that too, only 1 revision of these. So, not much experience with all this. Else, we should think of a strategy (a method within the plugin) to modify source_id after collecting field contents of some XML tag/attribute, and having specified some choice parameters. INHO this doesn't sound like a good way to go for various reasons. Bindu >Terry > > > > >------------------------------------------------------- >This SF.net email is sponsored by: VM Ware >With VMware you can run multiple operating systems on a single machine. >WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the >same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Jonathan C. <cra...@pc...> - 2003-07-22 15:07:31
|
Terry, Bindu- Bindu Gajria wrote: > You are right about the fact that > (taxon_id, external_database_release_id, source_id) is being used > to attempt retrieval of the sequence. As I think Steve mentioned, whenever we want to directly reference an entry in an external database from within GUS, we use the following pair of attributes in the referencing table (ExternalNASequence in this case): external_database_release_id source_id The naming is obviously bad, because there's nothing to indicate that the two columns are related in any way. We have adopted the convention that this pair of attributes should *uniquely* identify a single entry in the referenced database. In practice, however, there is nothing to enforce this convention, and so it has likely been violated in our current database. Here's an example where there appear to be duplicate rows for a given external_database_release_id, source_id (which actually may or may not be a violation of this specific convention, depending on whether the two sequences in question really are the same): SQL> select external_database_release_id from dots.externalnasequence where source_id = 'AJ276847'; EXTERNAL_DATABASE_RELEASE_ID ---------------------------- 2 2 In the above SQL you'll also note that we're not using the full GenBank accession number (e.g. AJ276847.1), which means that to find the correct GenBank entry you have to do one of two things: 1. Look at the NCBI "GI", stored (also by convention) in the secondary_identifier column. (In my example above one of the two sequences doesn't have a secondary_identifier.) 2. Use the GenBank release number (133) stored in the sres.ExternalDatabaseRelease table, to determine which version of AJ276847 was the one included in GenBank release 133. #2 is extremely inconvenient and so, although we may be sticking to the letter of our convention here (that external_database_release_id and source_id alone uniquely identify the referenced entry), we're not doing a good job of following its spirit. > Code commented out in the plugin, before attempting retrieval > of the externalNASequence, is : > > # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d) > # so, source_id needs to be cushioned with 0s (zeroes) > #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content}; > #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; } > #$enaSeq{source_id} = 'chrPyl_' . $tmpStr; > > The comment is *not* because plugin is still being worked on. It > exists as the source_id is being constructed from such a line in > the P_yoelii XML file: > > <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID> > > source_id here is chrPyl_00111 in GUS, and so needed the different > (than the P falciparum XML files) mechanism for the source_id Are you saying that to run the plugin on yoelii one has to uncomment the above code? If so then I agree with Terry's assessment that the code is still under construction; at the very least it should be possible to check (or require that the user specify) the taxon/species of the data to be loaded, and then select a method of constructing the source_id based on that. Making the user comment or uncomment code in order to get a program to work is not a good practice in general (though I'm certainly guilty of doing it in code that only I use) and it's an even worse practice in the GUS plugin framework, because when you make any change to the plugin the system will force you to update the AlgorithmImplementation entry in the database. Therefore alternately running the plugin on falciparum and yoelii would result in a long stream of apparent updates to the plugin (in the AlgorithmImplementation table) even though in reality nothing has changed except the commenting/uncommenting of this small piece of code. (p.s. a single "sprintf" would be a much more concise way of generating the 0-padded yoelii id than the while loop.) >> It's not clear to me that the intent of source_id is >> to uniquely identify the sequence. Is that so? > > yes, that is so, along with taxon_id and external_database_release_id. Strike taxon_id from the above statement. >> In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative >> candidates >> for uniquely naming a sequence. TIGR documents CLONE_NAME for internal >> use and ASMBL_ID for external use in uniquely identifying a sequence. In that case we would use ASMBL_ID as the source_id and CLONE_NAME as the secondary_identifier. We typically only use the GUS "name" column if we have a third id or name that hasn't been stored in either the source_id or the secondary_identifer. When loading GenBank entries we use it for the locus_id. Jonathan p.s. The above discussion covers direct links to external databases. We also have indirect links, which look like this: dots.NAFeature -> dots.NADbRefNAFeature -> sres.DbRef -> [external db] Typically the direct links are used when there is a clear 1-1 relationship between an entry in GUS and an entry in an external database, and the usual interpretation of such a link is that the GUS entry was *loaded from* the referenced external database entry. The links that go through DbRef, on the other hand, need not be 1-1, and the semantics are more varied (i.e. this thing links to that one, but they're not necessarily the same, and nor is one necessarily loaded from or derived from the other.) |