From: Jonathan C. <cra...@pc...> - 2003-07-22 15:07:31
|
Terry, Bindu- Bindu Gajria wrote: > You are right about the fact that > (taxon_id, external_database_release_id, source_id) is being used > to attempt retrieval of the sequence. As I think Steve mentioned, whenever we want to directly reference an entry in an external database from within GUS, we use the following pair of attributes in the referencing table (ExternalNASequence in this case): external_database_release_id source_id The naming is obviously bad, because there's nothing to indicate that the two columns are related in any way. We have adopted the convention that this pair of attributes should *uniquely* identify a single entry in the referenced database. In practice, however, there is nothing to enforce this convention, and so it has likely been violated in our current database. Here's an example where there appear to be duplicate rows for a given external_database_release_id, source_id (which actually may or may not be a violation of this specific convention, depending on whether the two sequences in question really are the same): SQL> select external_database_release_id from dots.externalnasequence where source_id = 'AJ276847'; EXTERNAL_DATABASE_RELEASE_ID ---------------------------- 2 2 In the above SQL you'll also note that we're not using the full GenBank accession number (e.g. AJ276847.1), which means that to find the correct GenBank entry you have to do one of two things: 1. Look at the NCBI "GI", stored (also by convention) in the secondary_identifier column. (In my example above one of the two sequences doesn't have a secondary_identifier.) 2. Use the GenBank release number (133) stored in the sres.ExternalDatabaseRelease table, to determine which version of AJ276847 was the one included in GenBank release 133. #2 is extremely inconvenient and so, although we may be sticking to the letter of our convention here (that external_database_release_id and source_id alone uniquely identify the referenced entry), we're not doing a good job of following its spirit. > Code commented out in the plugin, before attempting retrieval > of the externalNASequence, is : > > # NOTE: for P_yoelii, source_id is of the form: chrPyl_(\d\d\d\d\d) > # so, source_id needs to be cushioned with 0s (zeroes) > #my $tmpStr = $T->{ASSEMBLY}->{ASMBL_ID}->{content}; > #while (length ($tmpStr) < 5) { $tmpStr = '0'.$tmpStr; } > #$enaSeq{source_id} = 'chrPyl_' . $tmpStr; > > The comment is *not* because plugin is still being worked on. It > exists as the source_id is being constructed from such a line in > the P_yoelii XML file: > > <ASMBL_ID CLONE_NAME = "MALPY00111">111</ASMBL_ID> > > source_id here is chrPyl_00111 in GUS, and so needed the different > (than the P falciparum XML files) mechanism for the source_id Are you saying that to run the plugin on yoelii one has to uncomment the above code? If so then I agree with Terry's assessment that the code is still under construction; at the very least it should be possible to check (or require that the user specify) the taxon/species of the data to be loaded, and then select a method of constructing the source_id based on that. Making the user comment or uncomment code in order to get a program to work is not a good practice in general (though I'm certainly guilty of doing it in code that only I use) and it's an even worse practice in the GUS plugin framework, because when you make any change to the plugin the system will force you to update the AlgorithmImplementation entry in the database. Therefore alternately running the plugin on falciparum and yoelii would result in a long stream of apparent updates to the plugin (in the AlgorithmImplementation table) even though in reality nothing has changed except the commenting/uncommenting of this small piece of code. (p.s. a single "sprintf" would be a much more concise way of generating the 0-padded yoelii id than the while loop.) >> It's not clear to me that the intent of source_id is >> to uniquely identify the sequence. Is that so? > > yes, that is so, along with taxon_id and external_database_release_id. Strike taxon_id from the above statement. >> In the TIGR XML, the CLONE_NAME and ASMBL_ID are two alternative >> candidates >> for uniquely naming a sequence. TIGR documents CLONE_NAME for internal >> use and ASMBL_ID for external use in uniquely identifying a sequence. In that case we would use ASMBL_ID as the source_id and CLONE_NAME as the secondary_identifier. We typically only use the GUS "name" column if we have a third id or name that hasn't been stored in either the source_id or the secondary_identifer. When loading GenBank entries we use it for the locus_id. Jonathan p.s. The above discussion covers direct links to external databases. We also have indirect links, which look like this: dots.NAFeature -> dots.NADbRefNAFeature -> sres.DbRef -> [external db] Typically the direct links are used when there is a clear 1-1 relationship between an entry in GUS and an entry in an external database, and the usual interpretation of such a link is that the GUS entry was *loaded from* the referenced external database entry. The links that go through DbRef, on the other hand, need not be 1-1, and the semantics are more varied (i.e. this thing links to that one, but they're not necessarily the same, and nor is one necessarily loaded from or derived from the other.) |