Re: [XMLPipeDB-developer] GenMAPP multitaxon support - CMSI 486T
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-08-12 16:25:28
|
Saw your comments, some silly mistakes you caught. I re-ran the export using your changes, checked original row counts and they were identical to the previous two exports. The file size was also near identical to the multi-species aware export at 21.0 MB (22,036,480 bytes). I'm stumped on this but did confirm that GO data was processed. Richard On Thu, Aug 11, 2011 at 3:21 PM, John David N. Dionisio <do...@lm...>wrote: > Greetings, > > Sorry for the delay; stuff kept interceding through last night and today. > > I did finally manage to look at your latest commits then committed back > some tweaks and comments. Look in particular at the stuff I mentioned > regarding taxonIDs. If things aren't self-explanatory, just holler. > > I also scanned your changes to see if anything might have affected the file > size. Nothing jumped at me, unfortunately. Did you do the "Process GO > Data" step prior to exporting? That's the only [highly slim] lead that I > could think of. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Aug 10, 2011, at 10:07 PM, Richard Brous wrote: > > > OK, will hold up until we discuss further... > > > > > > > > **On a side note I exported two Msmegmatis gdb's to see how they compared > and found the following: > > > > pre multispecies aware build b64: gdb filesize = 24.5 MB (25,792,512 > bytes) > > > > multispecies current working copy: gdb filesize = 20.8 MB (21,905,408 > bytes) > > > > BUT when I checked orginal row counts for each they were identical... i > even checked a second time to be sure... > > > > See attached spreadsheet and I'll post the files on the biodb wiki as > well in a few. > > > > > > I'm highly suspicious of the file sizes being that different with > identical original row counts. > > > > Dondi would you take a look at my committed code to see if there are > glaring issues I am not aware of? > > > > Thanks. > > > > Richard > > > > On Wed, Aug 10, 2011 at 9:45 PM, John David N. Dionisio <do...@lm...> > wrote: > > Greetings, > > > > I think we have to turn to Dr. Dahlquist's GenMAPP knowledge here to get > the definitive answer. I see two choices: > > > > - The Info table should have one record for each species that the .gdb > holds, in which case the change you need is to wrap that single submit call > inside a loop, so that submit is called once for each chosen species. > > > > - The Info table should always have one record, and if the .gdb holds > multiple species, the "Species" column should be some concatenation of > multiple species names. In this case, you would still call submit only > once, but the value you send into the "Species" column is some accumulation > of all chosen species names. > > > > Admittedly I don't know which way is right (I assumed the former as of > our Tuesday meeting, but on further examination I'm no longer quite so > sure). > > > > For Kam --- what does GenMAPP expect to see in the Info table if the > opened .gdb contains multiple species? > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Aug 10, 2011, at 9:36 PM, Richard Brous wrote: > > > > > OK, continued to review ExportToGenMAPP and dug into the creation of > the first TableManager tmA on line 118. > > > > > > In reading through the method, my understanding is that it creates a > new TableManager based on the selectedDatabaseProfile (which is UniProt). > > > > > > This is performed by the method getInfoTableManager() which then calls > method submit(String tableName, QueryType queryType, String[][] > columnNamesToValues); > > > > > > the code is as follows: > > > > > > tableManager.submit("Info", QueryType.insert, new String[][] { { > "Owner", owner }, { "Version", new > SimpleDateFormat("yyyyMMdd").format(version) }, { "MODSystem", modSystem }, > { "Species", speciesProfile.getSpeciesName() }, { "Modify", new > SimpleDateFormat("yyyyMMdd").format(modify) }, { "DisplayOrder", > displayOrder }, { "Notes", notes } }); > > > > > > > > > The modification of this line centers on { "Species", > speciesProfile.getSpeciesName() }, since it originally processed a single > species. > > > > > > So now I need to populate the arguments with the species contained > within selectedDatabaseprofile.selectedSpeciesProfiles. > > > > > > I think I'll start with the baseArgument up to MODSystem, then append > as many species as necessary, and then cap off the end with the rest > starting at Modify. (similar to your approach in ExportGoData, > populateUniprotGoTableFromSQL(char chosenAspect, List<Integer> taxonIds) > line 513 > > > > > > Please let me know if this approach or analysis is off track. > > > > > > Thanks! > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 10, 2011 at 5:50 PM, Richard Brous <rbr...@gm...> > wrote: > > > Updated repository to include all Gene Ontology changes discussed > during our meeting yesterday. > > > > > > Digging into TableManager next. > > > > > > Richard > > > > > > On Fri, Aug 5, 2011 at 10:06 AM, Richard Brous <rbr...@gm...> > wrote: > > > whew... thanks for the detailed reply. I will digest this a bit and get > back to you with further questions. > > > > > > rb > > > > > > On Thu, Aug 4, 2011 at 11:18 PM, John David N. Dionisio <do...@lm...> > wrote: > > > Greetings, > > > > > > Sorry for the delay. I wasn't able to walk through the relevant code > until this evening. > > > > > > As Kam said, GOA serves as the link between the UniProt and GO IDs. It > essentially determines which GO IDs get exported by using GOA to see which > GO IDs are associated with an exported UniProt ID. The > populateUniprotGoTableFromSQL, in its current form, extracts the GO > association records that match the given taxon ID then exports, as > UniProt-GO pairs, the GO and UniProt IDs referenced within that GO > association record. Processing that follows this is then based on the GO > IDs that got exported --- and that's how the current code avoids exporting > the entire list of GO terms. > > > > > > The operative query is on the second line of > populateUniprotGoTableFromSQL: > > > > > > String uniProtAndGOIDSQL = "select db_object_id, go_id, > evidence_code, with_or_from from goa where db like '%UniProt%' and taxon = > 'taxon:" + taxon + "'"; > > > > > > In plain English, this selects the GOA records whose database is > UniProt and whose taxon ID is the given taxon. An additional condition is > added for the "aspect" (All, Component, Function, or Process) that is to be > exported. This is another reduction filter, to further shrink the number of > exported GO terms and thus avoid MAPPFinder issues later on. > > > > > > Given this, the proper expansion here is to change the taxon predicate > to a multiple predicate. That is, this method can be changed to now accept > a collection or array of taxon IDs, and the base query should then be > changed so that it accepts any taxon from that collection. More or less, > you want: > > > > > > private void populateUniprotGoTableFromSQL(char chosenAspect, int[] > taxons) throws SQLException { > > > > > > ...then, instead of the single string, you want to iterate through the > taxon IDs: > > > > > > StringBuilder baseQueryBuilder = new StringBuilder("select > db_object_id, go_id, evidence_code, with_or_from from goa where db like > '%UniProt%'"); > > > boolean first = true; > > > for (int taxon: taxons) { > > > baseQueryBuilder.append(first ? " and (" : " or "); > > > baseQueryBuilder > > > .append("taxon = 'taxon:") > > > .append(taxon).append("'"); > > > first = false; > > > } > > > baseQueryBuilder.append(")"); > > > > > > ...and so on. I just sort of rattled this off so there may be little > glitches, but anyway this is just to give you an overall idea. > > > > > > Put another way, no, you do not need to iterate this method for each > taxon ID. Instead, you can still call this method once, with the > multiplicity of taxon IDs emerging in terms of the actual condition used for > selecting the GO terms to be exported (based on the available GOA records, > which as you may recall are loaded from .goa files). > > > > > > As a side note, right here you have an opportunity for a little sanity > check regarding the content of the relational database: GO terms will only > be exported if GOA records for the desired taxon IDs have been imported into > the database. So, as a pre-flight check, one can see if there are any GOA > records at all for each chosen taxon ID. If there are none, then the .goa > file for that species needs to be imported into the relational database. > > > > > > Hope this helps... > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Aug 4, 2011, at 1:00 PM, Kam Dahlquist wrote: > > > > > > > Hi, > > > > > > > > Dondi will have to chime in on this, but I think this is where things > are going to get tricky. > > > > > > > > The final gdb does not actually contain the entire GO, it gets > trimmed somehow based on the GO associations for a particular species. This > is because MAPPFinder cannot handle loading the entire GO. Since there is > some type of species-specific trimming going on, it's quite possible that > this will need to iterate. > > > > > > > > However, I don't have the foggiest idea of how this works, so Dondi > will have to chime in. > > > > > > > > Best, > > > > Kam > > > > > > > > At 12:09 AM 8/4/2011, you wrote: > > > >> Wednesday 8/3/11 progress: > > > >> > > > >> 1. After following the ExportPanel1.java ground zero code of: > databaseProfile.setSelectedSpeciesProfile( selectedProfile ); > > > >> > > > >> I found the method in DatabaseProfile.java plus a getter method; > > > >> SpeciesProfile setSelectedSpeciesProfile( speciesProfile ) and > SpeciesProfile getSelectedSpeciesProfile( speciesProfile ) > > > >> > > > >> I created two new methods that each handle List<Object> of > SpeciesProfiles argument instead of a single SpeciesProfile; > setSelectedSpeciesProfiles and getSelectedSpeciesProfiles. > > > >> > > > >> This enabled the ExportPanel1 ground zero code to become: > databaseProfile.setSelectedSpeciesProfiles(selectedSpecies); > > > >> > > > >> 2. public static void export() on line 104 in ExportToGenMAPP.java > > > >> > > > >> On line 107 ExportGoData is instantiated which I found in > ExportGoData.java and calls a method: public void export(char chosenAspect, > int taxon). > > > >> > > > >> Within export, taxon id is required for another method: private void > populateGoTables(char chosenAspect, int taxon). > > > >> > > > >> Within populateGoTables, taxon id is required for another method: > private void populateUniprotGoTableFromSQL( char chosenAspect, int taxon). > > > >> > > > >> But, if the export to GDB process starts off with exporting GO data, > doesn't it only need to do that once no matter how many species are > selected? As you probably realize, I'm leading towards not having to iterate > through this for each taxon id if possible. > > > >> > > > >> Also, how does the export actually work? How are GO ids and UniProt > ids related within the table? > > > >> > > > >> Thanks! > > > >> > > > >> Richard > > > >> > > > >> > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > > ------------------------------------------------------------------------------ > > > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA > > > The must-attend event for mobile developers. Connect with experts. > > > Get tools for creating Super Apps. See the latest technologies. > > > Sessions, hands-on labs, demos & much more. Register early & save! > > > http://p.sf.net/sfu/rim-blackberry-1 > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Get a FREE DOWNLOAD! and learn more about uberSVN rich system, > > user administration capabilities and model configuration. Take > > the hassle out of deploying and managing Subversion and the > > tools developers use with it. > > http://p.sf.net/sfu/wandisco-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <CompareOriginalRowCounts.xlsx><ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Get a FREE DOWNLOAD! and learn more about uberSVN rich system, > user administration capabilities and model configuration. Take > the hassle out of deploying and managing Subversion and the > tools developers use with it. > http://p.sf.net/sfu/wandisco-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |