[XMLPipeDB-developer] file size discrepancies

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

This is where ye olde visual inspection comes in handy.  If you open 
up the tables in the database, you will see that the species field is 
blank for all of the systems tables (UniProt, RefSeq, etc.), the Info 
table, and the Systems table itself (for OrderedLocusNames).

Because data is missing for thousands of records, that results in the 
file size discrepancy.

Best,
Kam

At 09:25 AM 8/12/2011, you wrote:
>Saw your comments, some silly mistakes you caught.
>
>I re-ran the export using your changes, checked original row counts 
>and they were identical to the previous two exports. The file size 
>was also near identical to the multi-species aware export at 21.0 MB 
>(22,036,480 bytes). I'm stumped on this but did confirm that GO data 
>was processed.
>
>Richard
>
>On Thu, Aug 11, 2011 at 3:21 PM, John David N. Dionisio 
><<mailto:do...@lm...>do...@lm...> wrote:
>Greetings,
>
>Sorry for the delay; stuff kept interceding through last night and today.
>
>I did finally manage to look at your latest commits then committed 
>back some tweaks and comments.  Look in particular at the stuff I 
>mentioned regarding taxonIDs.  If things aren't self-explanatory, just holler.
>
>I also scanned your changes to see if anything might have affected 
>the file size.  Nothing jumped at me, unfortunately.  Did you do the 
>"Process GO Data" step prior to exporting?  That's the only [highly 
>slim] lead that I could think of.
>
>John David N. Dionisio, PhD
>Associate Professor, Computer Science
>Loyola Marymount University
>
>
>On Aug 10, 2011, at 10:07 PM, Richard Brous wrote:
>
> > OK, will hold up until we discuss further...
> >
> >
> >
> > **On a side note I exported two Msmegmatis gdb's to see how they 
> compared and found the following:
> >
> > pre multispecies aware build b64: gdb filesize = 24.5 MB (25,792,512 bytes)
> >
> > multispecies current working copy: gdb filesize = 20.8 MB 
> (21,905,408 bytes)
> >
> > BUT when I checked orginal row counts for each they were 
> identical... i even checked a second time to be sure...
> >
> > See attached spreadsheet and I'll post the files on the biodb 
> wiki as well in a few.
> >
> >
> > I'm highly suspicious of the file sizes being that different with 
> identical original row counts.
> >
> > Dondi would you take a look at my committed code to see if there 
> are glaring issues I am not aware of?
> >
> > Thanks.
> >
> > Richard
> >
> > On Wed, Aug 10, 2011 at 9:45 PM, John David N. Dionisio 
> <<mailto:do...@lm...>do...@lm...> wrote:
> > Greetings,
> >
> > I think we have to turn to Dr. Dahlquist's GenMAPP knowledge here 
> to get the definitive answer.  I see two choices:
> >
> > - The Info table should have one record for each species that the 
> .gdb holds, in which case the change you need is to wrap that 
> single submit call inside a loop, so that submit is called once for 
> each chosen species.
> >
> > - The Info table should always have one record, and if the .gdb 
> holds multiple species, the "Species" column should be some 
> concatenation of multiple species names.  In this case, you would 
> still call submit only once, but the value you send into the 
> "Species" column is some accumulation of all chosen species names.
> >
> > Admittedly I don't know which way is right (I assumed the former 
> as of our Tuesday meeting, but on further examination I'm no longer 
> quite so sure).
> >
> > For Kam --- what does GenMAPP expect to see in the Info table if 
> the opened .gdb contains multiple species?
> >
> > John David N. Dionisio, PhD
> > Associate Professor, Computer Science
> > Loyola Marymount University
> >
> >
> > On Aug 10, 2011, at 9:36 PM, Richard Brous wrote:
> >
> > > OK, continued to review ExportToGenMAPP and dug into the 
> creation of the first TableManager tmA on line 118.
> > >
> > > In reading through the method, my understanding is that it 
> creates a new TableManager based on the selectedDatabaseProfile 
> (which is UniProt).
> > >
> > > This is performed by the method getInfoTableManager() which 
> then calls method submit(String tableName, QueryType queryType, 
> String[][] columnNamesToValues);
> > >
> > > the code is as follows:
> > >
> > > tableManager.submit("Info", QueryType.insert, new String[][] { 
> { "Owner", owner }, { "Version", new 
> SimpleDateFormat("yyyyMMdd").format(version) }, { "MODSystem", 
> modSystem }, { "Species", speciesProfile.getSpeciesName() }, { 
> "Modify", new SimpleDateFormat("yyyyMMdd").format(modify) }, { 
> "DisplayOrder", displayOrder }, { "Notes", notes } });
> > >
> > >
> > > The modification of this line centers on { "Species", 
> speciesProfile.getSpeciesName() }, since it originally processed a 
> single species.
> > >
> > > So now I need to populate the arguments with the species 
> contained within selectedDatabaseprofile.selectedSpeciesProfiles.
> > >
> > > I think I'll start with the baseArgument up to MODSystem, then 
> append as many species as necessary, and then cap off the end with 
> the rest starting at Modify. (similar to your approach in 
> ExportGoData, populateUniprotGoTableFromSQL(char chosenAspect, 
> List<Integer> taxonIds) line 513
> > >
> > > Please let me know if this approach or analysis is off track.
> > >
> > > Thanks!
> > >
> > > Richard
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Aug 10, 2011 at 5:50 PM, Richard Brous 
> <<mailto:rbr...@gm...>rbr...@gm...> wrote:
> > > Updated repository to include all Gene Ontology changes 
> discussed during our meeting yesterday.
> > >
> > > Digging into TableManager next.
> > >
> > > Richard
> > >
> > > On Fri, Aug 5, 2011 at 10:06 AM, Richard Brous 
> <<mailto:rbr...@gm...>rbr...@gm...> wrote:
> > > whew... thanks for the detailed reply. I will digest this a bit 
> and get back to you with further questions.
> > >
> > > rb
> > >
> > > On Thu, Aug 4, 2011 at 11:18 PM, John David N. Dionisio 
> <<mailto:do...@lm...>do...@lm...> wrote:
> > > Greetings,
> > >
> > > Sorry for the delay.  I wasn't able to walk through the 
> relevant code until this evening.
> > >
> > > As Kam said, GOA serves as the link between the UniProt and GO 
> IDs.  It essentially determines which GO IDs get exported by using 
> GOA to see which GO IDs are associated with an exported UniProt 
> ID.  The populateUniprotGoTableFromSQL, in its current form, 
> extracts the GO association records that match the given taxon ID 
> then exports, as UniProt-GO pairs, the GO and UniProt IDs 
> referenced within that GO association record.  Processing that 
> follows this is then based on the GO IDs that got exported --- and 
> that's how the current code avoids exporting the entire list of GO terms.
> > >
> > > The operative query is on the second line of 
> populateUniprotGoTableFromSQL:
> > >
> > >        String uniProtAndGOIDSQL = "select db_object_id, go_id, 
> evidence_code, with_or_from from goa where db like '%UniProt%' and 
> taxon = 'taxon:" + taxon + "'";
> > >
> > > In plain English, this selects the GOA records whose database 
> is UniProt and whose taxon ID is the given taxon.  An additional 
> condition is added for the "aspect" (All, Component, Function, or 
> Process) that is to be exported.  This is another reduction filter, 
> to further shrink the number of exported GO terms and thus avoid 
> MAPPFinder issues later on.
> > >
> > > Given this, the proper expansion here is to change the taxon 
> predicate to a multiple predicate.  That is, this method can be 
> changed to now accept a collection or array of taxon IDs, and the 
> base query should then be changed so that it accepts any taxon from 
> that collection.  More or less, you want:
> > >
> > >    private void populateUniprotGoTableFromSQL(char 
> chosenAspect, int[] taxons) throws SQLException {
> > >
> > > ...then, instead of the single string, you want to iterate 
> through the taxon IDs:
> > >
> > >    StringBuilder baseQueryBuilder = new StringBuilder("select 
> db_object_id, go_id, evidence_code, with_or_from from goa where db 
> like '%UniProt%'");
> > >    boolean first = true;
> > >    for (int taxon: taxons) {
> > >        baseQueryBuilder.append(first ? " and (" : " or ");
> > >        baseQueryBuilder
> > >            .append("taxon = 'taxon:")
> > >            .append(taxon).append("'");
> > >        first = false;
> > >    }
> > >    baseQueryBuilder.append(")");
> > >
> > > ...and so on.  I just sort of rattled this off so there may be 
> little glitches, but anyway this is just to give you an overall idea.
> > >
> > > Put another way, no, you do not need to iterate this method for 
> each taxon ID.  Instead, you can still call this method once, with 
> the multiplicity of taxon IDs emerging in terms of the actual 
> condition used for selecting the GO terms to be exported (based on 
> the available GOA records, which as you may recall are loaded from .goa files).
> > >
> > > As a side note, right here you have an opportunity for a little 
> sanity check regarding the content of the relational database: GO 
> terms will only be exported if GOA records for the desired taxon 
> IDs have been imported into the database.  So, as a pre-flight 
> check, one can see if there are any GOA records at all for each 
> chosen taxon ID.  If there are none, then the .goa file for that 
> species needs to be imported into the relational database.
> > >
> > > Hope this helps...
> > >
> > > John David N. Dionisio, PhD
> > > Associate Professor, Computer Science
> > > Loyola Marymount University
> > >
> > >
> > > On Aug 4, 2011, at 1:00 PM, Kam Dahlquist wrote:
> > >
> > > > Hi,
> > > >
> > > > Dondi will have to chime in on this, but I think this is 
> where things are going to get tricky.
> > > >
> > > > The final gdb does not actually contain the entire GO, it 
> gets trimmed somehow based on the GO associations for a particular 
> species.  This is because MAPPFinder cannot handle loading the 
> entire GO.  Since there is some type of species-specific trimming 
> going on, it's quite possible that this will need to iterate.
> > > >
> > > > However, I don't have the foggiest idea of how this works, so 
> Dondi will have to chime in.
> > > >
> > > > Best,
> > > > Kam
> > > >
> > > > At 12:09 AM 8/4/2011, you wrote:
> > > >> Wednesday 8/3/11 progress:
> > > >>
> > > >> 1. After following the ExportPanel1.java ground zero code 
> of: databaseProfile.setSelectedSpeciesProfile( selectedProfile );
> > > >>
> > > >> I found the method in DatabaseProfile.java plus a getter method;
> > > >> SpeciesProfile setSelectedSpeciesProfile( speciesProfile ) 
> and SpeciesProfile getSelectedSpeciesProfile( speciesProfile )
> > > >>
> > > >> I created two new methods that each handle List<Object> of 
> SpeciesProfiles argument instead of a single SpeciesProfile; 
> setSelectedSpeciesProfiles and getSelectedSpeciesProfiles.
> > > >>
> > > >> This enabled the ExportPanel1 ground zero code to become: 
> databaseProfile.setSelectedSpeciesProfiles(selectedSpecies);
> > > >>
> > > >> 2. public static void export() on line 104 in ExportToGenMAPP.java
> > > >>
> > > >> On line 107 ExportGoData is instantiated which I found in 
> ExportGoData.java and calls a method: public void export(char 
> chosenAspect, int taxon).
> > > >>
> > > >> Within export, taxon id is required for another method: 
> private void populateGoTables(char chosenAspect, int taxon).
> > > >>
> > > >> Within populateGoTables, taxon id is required for another 
> method: private void populateUniprotGoTableFromSQL( char 
> chosenAspect, int taxon).
> > > >>
> > > >> But, if the export to GDB process starts off with exporting 
> GO data, doesn't it only need to do that once no matter how many 
> species are selected? As you probably realize, I'm leading towards 
> not having to iterate through this for each taxon id if possible.
> > > >>
> > > >> Also, how does the export actually work? How are GO ids and 
> UniProt ids related within the table?
> > > >>
> > > >> Thanks!
> > > >>
> > > >> Richard
> > > >>
> > > >>
> > > > <ATT00001..txt><ATT00002..txt>
> > >
> > >
> > > 
> ------------------------------------------------------------------------------
> > > BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
> > > The must-attend event for mobile developers. Connect with experts.
> > > Get tools for creating Super Apps. See the latest technologies.
> > > Sessions, hands-on labs, demos & much more. Register early & save!
> > > 
> <http://p.sf.net/sfu/rim-blackberry-1>http://p.sf.net/sfu/rim-blackberry-1
> > > _______________________________________________
> > > xmlpipedb-developer mailing list
> > > 
> <mailto:xml...@li...>xml...@li...
> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >
> > >
> > >
> > > <ATT00001..txt><ATT00002..txt>
> >
> >
> > 
> ------------------------------------------------------------------------------
> > Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> > user administration capabilities and model configuration. Take
> > the hassle out of deploying and managing Subversion and the
> > tools developers use with it.
> > <http://p.sf.net/sfu/wandisco-dev2dev>http://p.sf.net/sfu/wandisco-dev2dev
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > 
> <mailto:xml...@li...>xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> > <CompareOriginalRowCounts.xlsx><ATT00001..txt><ATT00002..txt>
>
>
>------------------------------------------------------------------------------
>Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
>user administration capabilities and model configuration. Take
>the hassle out of deploying and managing Subversion and the
>tools developers use with it.
><http://p.sf.net/sfu/wandisco-dev2dev>http://p.sf.net/sfu/wandisco-dev2dev
>_______________________________________________
>xmlpipedb-developer mailing list
><mailto:xml...@li...>xml...@li...
>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
>