Re: [XMLPipeDB-developer] GenMAPP multitaxon support - CMSI 486T
                
                Brought to you by:
                
                    kdahlquist,
                    
                
                    zugzugglug
                    
                
            
            
        
        
        
    | 
      
      
      From: Richard B. <rbr...@gm...> - 2011-08-12 16:25:28
      
     | 
| Saw your comments, some silly mistakes you caught.
I re-ran the export using your changes, checked original row counts and they
were identical to the previous two exports. The file size was also near
identical to the multi-species aware export at 21.0 MB (22,036,480 bytes).
I'm stumped on this but did confirm that GO data was processed.
Richard
On Thu, Aug 11, 2011 at 3:21 PM, John David N. Dionisio <do...@lm...>wrote:
> Greetings,
>
> Sorry for the delay; stuff kept interceding through last night and today.
>
> I did finally manage to look at your latest commits then committed back
> some tweaks and comments.  Look in particular at the stuff I mentioned
> regarding taxonIDs.  If things aren't self-explanatory, just holler.
>
> I also scanned your changes to see if anything might have affected the file
> size.  Nothing jumped at me, unfortunately.  Did you do the "Process GO
> Data" step prior to exporting?  That's the only [highly slim] lead that I
> could think of.
>
> John David N. Dionisio, PhD
> Associate Professor, Computer Science
> Loyola Marymount University
>
>
>  On Aug 10, 2011, at 10:07 PM, Richard Brous wrote:
>
> > OK, will hold up until we discuss further...
> >
> >
> >
> > **On a side note I exported two Msmegmatis gdb's to see how they compared
> and found the following:
> >
> > pre multispecies aware build b64: gdb filesize = 24.5 MB (25,792,512
> bytes)
> >
> > multispecies current working copy: gdb filesize = 20.8 MB (21,905,408
> bytes)
> >
> > BUT when I checked orginal row counts for each they were identical... i
> even checked a second time to be sure...
> >
> > See attached spreadsheet and I'll post the files on the biodb wiki as
> well in a few.
> >
> >
> > I'm highly suspicious of the file sizes being that different with
> identical original row counts.
> >
> > Dondi would you take a look at my committed code to see if there are
> glaring issues I am not aware of?
> >
> > Thanks.
> >
> > Richard
> >
> > On Wed, Aug 10, 2011 at 9:45 PM, John David N. Dionisio <do...@lm...>
> wrote:
> > Greetings,
> >
> > I think we have to turn to Dr. Dahlquist's GenMAPP knowledge here to get
> the definitive answer.  I see two choices:
> >
> > - The Info table should have one record for each species that the .gdb
> holds, in which case the change you need is to wrap that single submit call
> inside a loop, so that submit is called once for each chosen species.
> >
> > - The Info table should always have one record, and if the .gdb holds
> multiple species, the "Species" column should be some concatenation of
> multiple species names.  In this case, you would still call submit only
> once, but the value you send into the "Species" column is some accumulation
> of all chosen species names.
> >
> > Admittedly I don't know which way is right (I assumed the former as of
> our Tuesday meeting, but on further examination I'm no longer quite so
> sure).
> >
> > For Kam --- what does GenMAPP expect to see in the Info table if the
> opened .gdb contains multiple species?
> >
> > John David N. Dionisio, PhD
> > Associate Professor, Computer Science
> > Loyola Marymount University
> >
> >
> > On Aug 10, 2011, at 9:36 PM, Richard Brous wrote:
> >
> > > OK, continued to review ExportToGenMAPP and dug into the creation of
> the first TableManager tmA on line 118.
> > >
> > > In reading through the method, my understanding is that it creates a
> new TableManager based on the selectedDatabaseProfile (which is UniProt).
> > >
> > > This is performed by the method getInfoTableManager() which then calls
> method submit(String tableName, QueryType queryType, String[][]
> columnNamesToValues);
> > >
> > > the code is as follows:
> > >
> > > tableManager.submit("Info", QueryType.insert, new String[][] { {
> "Owner", owner }, { "Version", new
> SimpleDateFormat("yyyyMMdd").format(version) }, { "MODSystem", modSystem },
> { "Species", speciesProfile.getSpeciesName() }, { "Modify", new
> SimpleDateFormat("yyyyMMdd").format(modify) }, { "DisplayOrder",
> displayOrder }, { "Notes", notes } });
> > >
> > >
> > > The modification of this line centers on { "Species",
> speciesProfile.getSpeciesName() }, since it originally processed a single
> species.
> > >
> > > So now I need to populate the arguments with the species contained
> within selectedDatabaseprofile.selectedSpeciesProfiles.
> > >
> > > I think I'll start with the baseArgument up to MODSystem, then append
> as many species as necessary, and then cap off the end with the rest
> starting at Modify. (similar to your approach in ExportGoData,
> populateUniprotGoTableFromSQL(char chosenAspect, List<Integer> taxonIds)
> line 513
> > >
> > > Please let me know if this approach or analysis is off track.
> > >
> > > Thanks!
> > >
> > > Richard
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Aug 10, 2011 at 5:50 PM, Richard Brous <rbr...@gm...>
> wrote:
> > > Updated repository to include all Gene Ontology changes discussed
> during our meeting yesterday.
> > >
> > > Digging into TableManager next.
> > >
> > > Richard
> > >
> > > On Fri, Aug 5, 2011 at 10:06 AM, Richard Brous <rbr...@gm...>
> wrote:
> > > whew... thanks for the detailed reply. I will digest this a bit and get
> back to you with further questions.
> > >
> > > rb
> > >
> > > On Thu, Aug 4, 2011 at 11:18 PM, John David N. Dionisio <do...@lm...>
> wrote:
> > > Greetings,
> > >
> > > Sorry for the delay.  I wasn't able to walk through the relevant code
> until this evening.
> > >
> > > As Kam said, GOA serves as the link between the UniProt and GO IDs.  It
> essentially determines which GO IDs get exported by using GOA to see which
> GO IDs are associated with an exported UniProt ID.  The
> populateUniprotGoTableFromSQL, in its current form, extracts the GO
> association records that match the given taxon ID then exports, as
> UniProt-GO pairs, the GO and UniProt IDs referenced within that GO
> association record.  Processing that follows this is then based on the GO
> IDs that got exported --- and that's how the current code avoids exporting
> the entire list of GO terms.
> > >
> > > The operative query is on the second line of
> populateUniprotGoTableFromSQL:
> > >
> > >        String uniProtAndGOIDSQL = "select db_object_id, go_id,
> evidence_code, with_or_from from goa where db like '%UniProt%' and taxon =
> 'taxon:" + taxon + "'";
> > >
> > > In plain English, this selects the GOA records whose database is
> UniProt and whose taxon ID is the given taxon.  An additional condition is
> added for the "aspect" (All, Component, Function, or Process) that is to be
> exported.  This is another reduction filter, to further shrink the number of
> exported GO terms and thus avoid MAPPFinder issues later on.
> > >
> > > Given this, the proper expansion here is to change the taxon predicate
> to a multiple predicate.  That is, this method can be changed to now accept
> a collection or array of taxon IDs, and the base query should then be
> changed so that it accepts any taxon from that collection.  More or less,
> you want:
> > >
> > >    private void populateUniprotGoTableFromSQL(char chosenAspect, int[]
> taxons) throws SQLException {
> > >
> > > ...then, instead of the single string, you want to iterate through the
> taxon IDs:
> > >
> > >    StringBuilder baseQueryBuilder = new StringBuilder("select
> db_object_id, go_id, evidence_code, with_or_from from goa where db like
> '%UniProt%'");
> > >    boolean first = true;
> > >    for (int taxon: taxons) {
> > >        baseQueryBuilder.append(first ? " and (" : " or ");
> > >        baseQueryBuilder
> > >            .append("taxon = 'taxon:")
> > >            .append(taxon).append("'");
> > >        first = false;
> > >    }
> > >    baseQueryBuilder.append(")");
> > >
> > > ...and so on.  I just sort of rattled this off so there may be little
> glitches, but anyway this is just to give you an overall idea.
> > >
> > > Put another way, no, you do not need to iterate this method for each
> taxon ID.  Instead, you can still call this method once, with the
> multiplicity of taxon IDs emerging in terms of the actual condition used for
> selecting the GO terms to be exported (based on the available GOA records,
> which as you may recall are loaded from .goa files).
> > >
> > > As a side note, right here you have an opportunity for a little sanity
> check regarding the content of the relational database: GO terms will only
> be exported if GOA records for the desired taxon IDs have been imported into
> the database.  So, as a pre-flight check, one can see if there are any GOA
> records at all for each chosen taxon ID.  If there are none, then the .goa
> file for that species needs to be imported into the relational database.
> > >
> > > Hope this helps...
> > >
> > > John David N. Dionisio, PhD
> > > Associate Professor, Computer Science
> > > Loyola Marymount University
> > >
> > >
> > > On Aug 4, 2011, at 1:00 PM, Kam Dahlquist wrote:
> > >
> > > > Hi,
> > > >
> > > > Dondi will have to chime in on this, but I think this is where things
> are going to get tricky.
> > > >
> > > > The final gdb does not actually contain the entire GO, it gets
> trimmed somehow based on the GO associations for a particular species.  This
> is because MAPPFinder cannot handle loading the entire GO.  Since there is
> some type of species-specific trimming going on, it's quite possible that
> this will need to iterate.
> > > >
> > > > However, I don't have the foggiest idea of how this works, so Dondi
> will have to chime in.
> > > >
> > > > Best,
> > > > Kam
> > > >
> > > > At 12:09 AM 8/4/2011, you wrote:
> > > >> Wednesday 8/3/11 progress:
> > > >>
> > > >> 1. After following the ExportPanel1.java ground zero code of:
> databaseProfile.setSelectedSpeciesProfile( selectedProfile );
> > > >>
> > > >> I found the method in DatabaseProfile.java plus a getter method;
> > > >> SpeciesProfile setSelectedSpeciesProfile( speciesProfile ) and
> SpeciesProfile getSelectedSpeciesProfile( speciesProfile )
> > > >>
> > > >> I created two new methods that each handle List<Object> of
> SpeciesProfiles argument instead of a single SpeciesProfile;
> setSelectedSpeciesProfiles and getSelectedSpeciesProfiles.
> > > >>
> > > >> This enabled the ExportPanel1 ground zero code to become:
> databaseProfile.setSelectedSpeciesProfiles(selectedSpecies);
> > > >>
> > > >> 2. public static void export() on line 104 in ExportToGenMAPP.java
> > > >>
> > > >> On line 107 ExportGoData is instantiated which I found in
> ExportGoData.java and calls a method: public void export(char chosenAspect,
> int taxon).
> > > >>
> > > >> Within export, taxon id is required for another method: private void
> populateGoTables(char chosenAspect, int taxon).
> > > >>
> > > >> Within populateGoTables, taxon id is required for another method:
> private void populateUniprotGoTableFromSQL( char chosenAspect, int taxon).
> > > >>
> > > >> But, if the export to GDB process starts off with exporting GO data,
> doesn't it only need to do that once no matter how many species are
> selected? As you probably realize, I'm leading towards not having to iterate
> through this for each taxon id if possible.
> > > >>
> > > >> Also, how does the export actually work? How are GO ids and UniProt
> ids related within the table?
> > > >>
> > > >> Thanks!
> > > >>
> > > >> Richard
> > > >>
> > > >>
> > > > <ATT00001..txt><ATT00002..txt>
> > >
> > >
> > >
> ------------------------------------------------------------------------------
> > > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA
> > > The must-attend event for mobile developers. Connect with experts.
> > > Get tools for creating Super Apps. See the latest technologies.
> > > Sessions, hands-on labs, demos & much more. Register early & save!
> > > http://p.sf.net/sfu/rim-blackberry-1
> > > _______________________________________________
> > > xmlpipedb-developer mailing list
> > > xml...@li...
> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >
> > >
> > >
> > > <ATT00001..txt><ATT00002..txt>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> > user administration capabilities and model configuration. Take
> > the hassle out of deploying and managing Subversion and the
> > tools developers use with it.
> > http://p.sf.net/sfu/wandisco-dev2dev
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> > <CompareOriginalRowCounts.xlsx><ATT00001..txt><ATT00002..txt>
>
>
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it.
> http://p.sf.net/sfu/wandisco-dev2dev
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
 |