Re: [XMLPipeDB-developer] GenMAPP multitaxon support - CMSI 486T

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Greetings,

Sorry for the delay.  I wasn't able to walk through the relevant code until this evening.

As Kam said, GOA serves as the link between the UniProt and GO IDs.  It essentially determines which GO IDs get exported by using GOA to see which GO IDs are associated with an exported UniProt ID.  The populateUniprotGoTableFromSQL, in its current form, extracts the GO association records that match the given taxon ID then exports, as UniProt-GO pairs, the GO and UniProt IDs referenced within that GO association record.  Processing that follows this is then based on the GO IDs that got exported --- and that's how the current code avoids exporting the entire list of GO terms.

The operative query is on the second line of populateUniprotGoTableFromSQL:

	String uniProtAndGOIDSQL = "select db_object_id, go_id, evidence_code, with_or_from from goa where db like '%UniProt%' and taxon = 'taxon:" + taxon + "'"; 

In plain English, this selects the GOA records whose database is UniProt and whose taxon ID is the given taxon.  An additional condition is added for the "aspect" (All, Component, Function, or Process) that is to be exported.  This is another reduction filter, to further shrink the number of exported GO terms and thus avoid MAPPFinder issues later on.

Given this, the proper expansion here is to change the taxon predicate to a multiple predicate.  That is, this method can be changed to now accept a collection or array of taxon IDs, and the base query should then be changed so that it accepts any taxon from that collection.  More or less, you want:

    private void populateUniprotGoTableFromSQL(char chosenAspect, int[] taxons) throws SQLException {

...then, instead of the single string, you want to iterate through the taxon IDs:

    StringBuilder baseQueryBuilder = new StringBuilder("select db_object_id, go_id, evidence_code, with_or_from from goa where db like '%UniProt%'");
    boolean first = true;
    for (int taxon: taxons) {
        baseQueryBuilder.append(first ? " and (" : " or ");
        baseQueryBuilder
            .append("taxon = 'taxon:")
            .append(taxon).append("'");
        first = false;
    }
    baseQueryBuilder.append(")");

...and so on.  I just sort of rattled this off so there may be little glitches, but anyway this is just to give you an overall idea.

Put another way, no, you do not need to iterate this method for each taxon ID.  Instead, you can still call this method once, with the multiplicity of taxon IDs emerging in terms of the actual condition used for selecting the GO terms to be exported (based on the available GOA records, which as you may recall are loaded from .goa files).

As a side note, right here you have an opportunity for a little sanity check regarding the content of the relational database: GO terms will only be exported if GOA records for the desired taxon IDs have been imported into the database.  So, as a pre-flight check, one can see if there are any GOA records at all for each chosen taxon ID.  If there are none, then the .goa file for that species needs to be imported into the relational database.

Hope this helps...

John David N. Dionisio, PhD
Associate Professor, Computer Science
Loyola Marymount University

On Aug 4, 2011, at 1:00 PM, Kam Dahlquist wrote:

> Hi,
> 
> Dondi will have to chime in on this, but I think this is where things are going to get tricky.
> 
> The final gdb does not actually contain the entire GO, it gets trimmed somehow based on the GO associations for a particular species.  This is because MAPPFinder cannot handle loading the entire GO.  Since there is some type of species-specific trimming going on, it's quite possible that this will need to iterate.
> 
> However, I don't have the foggiest idea of how this works, so Dondi will have to chime in.
> 
> Best,
> Kam
> 
> At 12:09 AM 8/4/2011, you wrote:
>> Wednesday 8/3/11 progress:
>>  
>> 1. After following the ExportPanel1.java ground zero code of: databaseProfile.setSelectedSpeciesProfile( selectedProfile );
>>  
>> I found the method in DatabaseProfile.java plus a getter method;
>> SpeciesProfile setSelectedSpeciesProfile( speciesProfile ) and SpeciesProfile getSelectedSpeciesProfile( speciesProfile )
>>  
>> I created two new methods that each handle List<Object> of SpeciesProfiles argument instead of a single SpeciesProfile; setSelectedSpeciesProfiles and getSelectedSpeciesProfiles.
>>  
>> This enabled the ExportPanel1 ground zero code to become: databaseProfile.setSelectedSpeciesProfiles(selectedSpecies);
>>  
>> 2. public static void export() on line 104 in ExportToGenMAPP.java
>>  
>> On line 107 ExportGoData is instantiated which I found in ExportGoData.java and calls a method: public void export(char chosenAspect, int taxon). 
>> 
>> Within export, taxon id is required for another method: private void populateGoTables(char chosenAspect, int taxon).
>> 
>> Within populateGoTables, taxon id is required for another method: private void populateUniprotGoTableFromSQL( char chosenAspect, int taxon).
>> 
>> But, if the export to GDB process starts off with exporting GO data, doesn't it only need to do that once no matter how many species are selected? As you probably realize, I'm leading towards not having to iterate through this for each taxon id if possible.
>> 
>> Also, how does the export actually work? How are GO ids and UniProt ids related within the table?
>> 
>> Thanks!
>> 
>> Richard
>>  
>> 
> <ATT00001..txt><ATT00002..txt>