xmlpipedb-developer Mailing List for XMLPipeDB (Page 14)
Brought to you by:
kdahlquist,
zugzugglug
You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(16) |
Sep
|
Oct
(9) |
Nov
(3) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(2) |
Feb
(8) |
Mar
|
Apr
(22) |
May
(1) |
Jun
|
Jul
|
Aug
(3) |
Sep
(32) |
Oct
(2) |
Nov
|
Dec
|
2011 |
Jan
|
Feb
(60) |
Mar
(42) |
Apr
(35) |
May
(17) |
Jun
(2) |
Jul
(23) |
Aug
(72) |
Sep
(15) |
Oct
(10) |
Nov
(14) |
Dec
(4) |
2012 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(11) |
Dec
|
2014 |
Jan
(1) |
Feb
(12) |
Mar
(14) |
Apr
(8) |
May
|
Jun
(14) |
Jul
(2) |
Aug
|
Sep
(5) |
Oct
(6) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
(5) |
Mar
(2) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Richard B. <rbr...@gm...> - 2011-02-23 19:46:46
|
Great! Now the thumbs up from Dondi and we can get cracking. I'll dig into sql syntax for negation of a specific tuple while we wait for his response. Richard On Wed, Feb 23, 2011 at 11:37 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > That sounds good to me. > > Best, > Dr. D > > > At 11:16 AM 2/23/2011, you wrote: > > Hmm... possible that the method to split up records has been globally > implemented for export from Postgres? > > Since the records are automatically splitting when exported from postgres > to gdb file, all we need to do is exclude Rv3346/55c 'ORF' from "select > count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set. > > That seem correct Dondi? > > Richard > > On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist <kda...@lm...> > wrote: > Hi, > > I checked these out. They all should be separated and kept as separate > OrderedLocusNames records. In fact, the gdb already has them separated into > individual records, so it is only the tally for Postgres that is off, they > were correctly exported into the gdb. > > Best, > Dr. D > > > At 07:31 PM 2/22/2011, Richard Brous wrote: > > OK I found the culprits: > > There are 3 ordered locus gene ID's that have slashes. As was previously > discussed it seems XML Match is counting each one twice which would inflate > the count by 3. > > Rv2561/Rv2562 'ordered locus' > Rv2880c/Rv2879c 'ordered locus' > Rv3021c/Rv3022c 'ordered locus' > > And as was previously noted, Rv3346/55c 'ORF' should be excluded. > > This will bring the both XML and sql db queries counts in sync at 4057. > > By my math: > > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by > slashes = total 4057 genes > > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > > > TO MOVE FORWARD: > > 1. We need to decide how to address the slash ID's (should we keep and > split into separate tuples or should they be excluded) > > 2. Adjust the queries to reflect what is needed and test with raw sql. > > 3. Update the queries on > MycobacteriumTuberculosisUniProtSpeciesProfile.java > > 4. Export a new Mtb gene database for testing. > > > Richard > > > > On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > I don't know why the numbers are off. I think the only way to find out is > to get the results of the Postgres query and line it up next to the match > results and see what is different. If you can send me the lists of IDs from > the Postgres query and the match results, I can check them. > Best, Dr. D > > At 04:57 PM 2/22/2011, you wrote: > > Here is where I am on the numbers: ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: 4066 unique matches - 6 genes excluded by Dr. D = > total 4060 genes ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > So by my count we are off by 3 genes Maybe XML Match is counting slashed > genes as 2 separate genes? IE. so if it encountered 3 slashed genes it > would in effect be counting those 3 as 6? Richard > On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, I compared this list of gene IDs with the list I had on the Testing > Report on the wiki and found the following. Your list has 70 IDs. 69 of > them are identical to what is in my list of IDs in the Testing report. Indeed, > the only difference is the one with the slash. "Rv3346/55c" I looked this > ID up at UniProt.org as Rv3346. It appears in the record for UniProt ID > O50384. For that UniProt record, the ID referred to in the OrderedLocus tag > is Rv3355c which does appear in the gdb already. Looking up Rv3346 on > Tuberculist and the Stanford TB database, Rv3346 is not a real gene ID. So > this entire record with the slash in it needs to be excluded and all the > rest of them need to be captured. > I think then, that the numbers should add up correctly, is this true? Best, > Dr. D > At 01:09 PM 2/22/2011, Richard Brous wrote: > > attachement now included... On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous > <rbr...@gm...> wrote: Here is an export of the genes found using: > select * from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. > 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 > 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 > 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 > 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 > 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 > 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 > 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 > 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 > 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 > 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 > 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 > 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 > 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 > 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 > 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 > 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 > 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 > 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 > 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 > 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 > 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 > 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 > 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 > 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 > 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 > 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 > 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 > 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 > 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 > 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 > 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 > 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 > 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 > 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 > 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 > 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 > 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 > 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 > 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 > 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 > 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 > 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 > 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 > 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 > 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 > 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 > 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 > 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 > 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 > 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 > 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 > 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 > 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 > 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 > 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 > 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 > 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 > 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 > 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 > 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 > 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 > 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 > 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 > 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 > 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 > 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 > 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 > 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 > 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 > 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The > item of note I see is the gene with the slash separating gene id's which > refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 PM, > Richard Brous <rbr...@gm...> wrote: Understood. I'll check in with > Dr. D in the afternoon tomorrow and discuss. Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> > wrote: Hi Rich, Addressing the release business first, let's put it this > way: if the remaining loose ends can be addressed by tomorrow, we can > probably wait until then. If unexpected snags are encountered, then it > would be worthwhile to release whatever you have. With that in mind, > considering that you pretty much know the patterns of the IDs that are > needed, I think it will only take a little digital forensic work now to > figure out exactly which IDs are still needed. Once you know what those > are, you should: 1. Find where they are in the XML file. 2. Knowing the > XML location, find the corresponding table in the relational database (table > names are generally derived from tag/element names). 3. Knowing the table > in the database, write or extend the SpeciesProfile query to retrieve that > data. For the ID that must *not* be included, again it's a matter of > tracking down what this ID is. Knowing this straggler, you can then consult > with Dr. Dahlquist if this ID is truly a unique one-off, or is > representative of a pattern that we'll want to exclude. Either way, this ID > can be omitted by using "not" or "<>" or possibly "not like" or "not ~" > (check PostgreSQL where clause syntax to see where the negation can be > applied). John David N. Dionisio, PhD Associate Professor, Computer > Science Loyola Marymount University > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > actually i had a typo > (emailing from desktop system but testing on my laptop... typed correctly > here but wrong in pgadmin) but the results make much more sense now. > > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > returns 4058 > > select count (*) from > genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > > > -------------------------------------------------------------------------------------------------------------------------- > > > > Continuing forward - > > The testing report says that 4066 unique > matches exist in XML but 6 of them were eliminated by Dr. D leaving the > desired number at 4060. > > So now we are only 2 genes short with the > query returning 4058... which is also (conveniently) the sum of the two > separate queries of 'ordered locus' and 'ORF' respectively. > > But > recall that Dr. D said that only 69 genes of the missing 75 were tagged > 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding > that into missing genes puts us 3 short... > > Should I make the changes > to the code and export a gdb so that analysis can be done or wait until we > work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at > 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The > second form should have worked actually. What exactly was the error? > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard > Brous wrote: > > > hmm not taking parenthesis where I thought they should > go... syntax error > > > > select count (*) from genenametype where type = > ('ordered locus' or 'ORF') and value like 'Rv%'; > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 > at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see > it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio > <do...@lm...> wrote: > > Watch your parentheses: "and" has greater > precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> > wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> > 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where > type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > > >> returns 70 > >> > >> select count (*) from genenametype where type = > 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 > > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > >> select count (*) from genenametype where type = 'ordered locus' > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 > genes were were looking for plus one more (maybe the count for missing genes > is off by 1?). > >> > >> 2. Combining the two types in a single query did > not produce the results that I expected (7011? - how did that happen????) so > this is likely not our solution... unless of course the query syntax isn't > actually doing what I think it is... > >> > >> 3. I would think the best > course of action is to serialy run two separate queries to capture all the > required genes, then removing the one unneeded gene if its truly not wanted. > > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb > 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I > don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, > Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, > 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So > here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > >>> > >>> // Build the base > query; we only use "ordered locus" and we only want > >>> > >>> // IDs > that begin with "Rv." > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN > entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = > entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and > value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; > > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) > { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually > want to keep the case where multiple ordered locus > >>> > >>> // names > appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this > name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > > >>> for (String id : result.getString("value").split("/")) { > >>> > tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, > new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, > { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { > "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there > are more comprehensive ways to search for gene ID's by matching gene ID > prefix but "like Rv%" seemed to work thus far, we just need to tell it to > search for XML tag type orf in addition to ordered locus. > >>> > >>> > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part > of our class was right as the server went down and my submission for week 6 > assignment I can't seem to find. > >>> > >>> Is it possible to have two > different types in the same query or should we rewrite a separate query for > the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, > 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>> > thanks and will do as directed. > >>> > >>> My previous, last paragraph > comment - A way for programming code in email holding its format in a mail > message similarly to how you can post code on forum pages? > >>> > >>> > <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David > N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> > Actually, gmbuilder.properties is for the TallyEngine only. When dealing > with .gdb exports, look *only* at the SpeciesProfile class. So, to find > those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile > code, that needs to be changed. > >>> > >>> Your take on how > gmbuilder.properties is used, however, is understandable. It makes sense to > assume that the TallyEngine code *and* the ID export code are based on the > same characterization of the needed IDs. This replication is originally a > historical artifact: SpeciesProfile was done first, and then TallyEngine was > done later by another student. > >>> > >>> However, there are other > factors beyond history that sort of necessitate this duplication of desired > IDs: (skip the two bullets below if you'd rather cut to the chase of the > work to be done, and discuss design issues later) > >>> > >>> - The actual > XML import code is a black box: this is the "canned" JAXB library actually > in action, and not our code at all. Plus, the XML import code really does > not filter (nor should it), since the goal of the XML->relational database > step is to fully capture the XML data in the relational database. So, XML > count is necessarily separated from XML import. > >>> > >>> - The notion > of a declarative mechanism for extracting IDs from the relational database > (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at > the same time there is value in the arbitrary computation that can be done > with Java (case in point: export two versions of an ID, with and without > periods). This is not to say that it is impossible to do this > declaratively, but let's just say that the procedural approach exists here > and now, and a declarative approach will need more thought. > >>> > >>> > These, and other factors, are good thoughts to hold onto and would be worthy > of a good meeting discussion sometime, but bottom line for now: modifying > the export behavior is a matter of editing the *SpeciesProfile* Java code, > and not the gmbuilder.properties file. Turn your attention to that code. > > >>> > >>> Now, as to annotating your code...I'd just put in code comments > :) Or did you mean something else by tagging code in e-mail? > >>> > >>> > John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, > 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag > code in email so it holds its formatting? I tried a few suggestions I found > on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, > Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates > and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat > down and reviewed the gene testing report. We verified that XML match does > indeed find 4066 unique matches - 75 of which are not in the gdb and need to > be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who > completed the gene db testing report, not a previous student of BIO367 and > had already verified which genes were missing and where they were to be > found. I had (mistakenly) assumed that since a student had performed the > gene database testing I had to redo all of the verification. > >>> > > >>> > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > >>> > Per the gene db testing report: "69 of them have an "a", > "b", or "d" suffix. They are all found in the ORF tag and need to be > included in the gdb." > >>> > > >>> > To solve this we need to add > additional search criteria into the M. tuberculosis section in > gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my > understanding that the second line is the query used by TallyEngine to read > the XML file? If so then this is the issue we need to table for the moment > until we get the gbd verified and re-released. We will revisit this to > discover why it is not only reporting incorrectly but also why its added a > second row of Ordered Locus on the TallyEngine results page. > >>> > > >>> > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > >>> > > >>> > > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > >>> > Of course these queries would have be manually verified > prior to making these changes but this seems like we are moving in the right > direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, > 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just > got done reading previous email and understand the change in priority. > > >>> > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more > formalized weekly meeting. I was going to suggest it myself as it has been > slow going so far as maybe i'm a bit too independent in this independent > study class =D > >>> > > >>> > Will dig further into the missing ID's > later tonight and during day tomorrow and report back. > >>> > > >>> > > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. > Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No > problem. The pertinent line you're referring to, for XML, is this, right > above the line you copied: > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > >>> > The slash-separated section is the "path" of XML > tags leading to the element of interest; then, after the ampersand, is a > name/value pair for the desired attribute to count. Note that there is no > hint of a *content*-based filter (nor is there the capability for one, as > far as I can tell in the code). By "content," I mean that we can't specify > filters based on what's *between* the tags. We can only go as far as filter > by attribute value, e.g., type="ordered locus". > >>> > > >>> > But > anyway, as mentioned in the earlier e-mail, let's have the missing IDs in > the .gdb take precedence for now. Please take a look at the tuberculosis, > A. thaliana, and P. falciparum profiles to get an idea for how the ID output > can be customized, then let me know if you have any questions or need to > confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous > wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment > that kept needing continued iteration and it has been all consuming until > last night. But I did get a chance to work with your comments and review the > code again with a different mind set. > >>> > > > >>> > > Yes, I examined > the gmbuilder.properties file ( the query is also in the > MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > >>> > > > >>> > > I understood that: > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; was the db query but > don't see which is the XML count... or do they share the same query and you > are saying that XML count doesn't recognize and therefore cannot use the > 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the > delay. Had some distractions coming into the weekend. > >>> > > > >>> > > > You've looked at the code; have you looked at gmbuilder.properties? (I may > have mentioned it a few e-mails ago, just as you were starting to dig into > this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block > looks like this (indentation is mine to set it apart): > >>> > > > >>> > > > # Mycobacterium tuberculosis > >>> > > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > > >>> > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does > not filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, > I don't think the TallyEngine can include selective filtering in the XML > counts. If the need to do selective filtering on XML is necessary, then I > think we're looking at a new functionality for you to implement (or, if this > throws things off too much, this may have to be noted somewhere, that the > XML vs. database counts may be off because the database count is doing some > text-based filtering but the XML count does not). > >>> > > > >>> > > What > does xmlpipedb-match say? That will at least tell you whether the 'RV%' > count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount > University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at > 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was > able to put together from the past few hours of code review: > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > > -reveals that after the 2 System table modifications are made adding species > name and link, a PreparedStatement is instantiated which builds and calls > the base query. > >>> > > > > >>> > > > -The base query called is: > ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " > + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE > type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for > any tuple that starts with Rv (followed by any substring) and > entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage > are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method > doTallies() at code line 895 which: > >>> > > > Instantiates a > Configuration called hibernateConfiguration and assigns to it the current > hibernate configuration > >>> > > > Validates database settings by > analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList > for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a > CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines > if both xml files exist > >>> > > > Then getTallyResultsXML and > getTallyResultsDatabase are run on both xml files and their respective > CriterionList > >>> > > > Results are then formatted for display in a > table. > >>> > > > > >>> > > > -So enum TallyType which means that they > are the only valid datatypes which TallyEngine accepts... go to know ... > > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would > seem that both getTallyResultsXML() and getTallyResultsDatabase() are > incorrectly returning. Likely due to both using an incorrect query (as we > previously supposed). But where are the queries?... the more I dig the more > I think they are in the criterial all the work is done against. > >>> > > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() > calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new > QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced > methods originate from TallyEngine.java... > >>> > > > > >>> > > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls > digestXmlFile() which instantiates a digester then processes against > criteria... but this quickly becomes confusing and is hard to follow > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a > query but then I also get a bit lost with my limited db knowledge. > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues > but I still feel as if I'm missing some understanding to proceed further. > Can you pass along some of that Dondi insight and steer me in the right > direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken > databases yet certainly is limiting my ability determine where the > "criteria" are being set and how they are followed during session > activities. Also is the query we have been looking for this whole time in > the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again > is the query contained within the criteria that digestXmlFile() uses to > parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> > wrote: > >>> > > > Right, schema issues are unlikely. Most count > discrepancies like this that I've seen have boiled down to forming the right > query. Then, knowing the right query (in both XML and SQL), it's a matter > of making sure that TallyEngine asks that same query. > >>> > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer > Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. > I'll start with reviewing the queries for xmlpipedb-match and sql queries > needed for the respective results as you requested. > >>> > > > > > >>> > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the > code: trace the entrance of tally engine in the gmbuilder code then follow > it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. > Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm > (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source > code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate > Professor, Computer Science > >>> > > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at > 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > >>> > > > > > So I'm at the point in working with M > tuberculosis that I was able to exactly reproduce Dr. Dahlquist's > problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 > Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > Now the proverbial question - What next to solve the Ordered Locus > import/count issue? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > Here is my > thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the > import process work at the high level? (obviously correct me if I'm wrong) > > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is > read, it is placed in the proper Postgres table(s) based on some criteria. > There is also likely some sort of check that each individual tag is in valid > XML format unless we don't care at this stage (care at export) or maybe the > parser just skips over and goes on to the next . > >>> > > > > > > >>> > > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly > (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table > they should be added to? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > > >>> > > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > > XMLPipeDB has a modular architecture with three components that may be used > separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and > automatically generates an SQL schema, Java classes, and Hibernate mappings. > XMLPipeDB Utilities provides functionality for configuring the database, > importing data, and performing queries. GenMAPP Builder is based on the > XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on > data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I > don't see any in the basic distribution or are they not standalone and > called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > >>> > > > > > Richard > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-02-23 19:37:42
|
Hi, That sounds good to me. Best, Dr. D At 11:16 AM 2/23/2011, you wrote: >Hmm... possible that the method to split up records has been >globally implemented for export from Postgres? > >Since the records are automatically splitting when exported from >postgres to gdb file, all we need to do is exclude Rv3346/55c 'ORF' >from "select count (*) from genenametype where (type = 'ordered >locus' or type = 'ORF') and value ~ >'[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set. > >That seem correct Dondi? > >Richard > >On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: >Hi, > >I checked these out. They all should be separated and kept as >separate OrderedLocusNames records. In fact, the gdb already has >them separated into individual records, so it is only the tally for >Postgres that is off, they were correctly exported into the gdb. > >Best, >Dr. D > > >At 07:31 PM 2/22/2011, Richard Brous wrote: >>OK I found the culprits: >> >>There are 3 ordered locus gene ID's that have slashes. As was >>previously discussed it seems XML Match is counting each one twice >>which would inflate the count by 3. >> >>Rv2561/Rv2562 'ordered locus' >>Rv2880c/Rv2879c 'ordered locus' >>Rv3021c/Rv3022c 'ordered locus' >> >>And as was previously noted, Rv3346/55c 'ORF' should be excluded. >> >>This will bring the both XML and sql db queries counts in sync at 4057. >> >>By my math: >>------------------------------------------------------------------------------------------------------------ >>XML tags read by Match: >>4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates >>caused by slashes = total 4057 genes >>------------------------------------------------------------------------------------------------------------ >>sql query of the db tables: >>3988 ordered locus + 69 ORF = total 4057 genes >>------------------------------------------------------------------------------------------------------------ >> >> >>TO MOVE FORWARD: >> >>1. We need to decide how to address the slash ID's (should we keep >>and split into separate tuples or should they be excluded) >> >>2. Adjust the queries to reflect what is needed and test with raw sql. >> >>3. Update the queries on MycobacteriumTuberculosisUniProtSpeciesProfile.java >> >>4. Export a new Mtb gene database for testing. >> >> >>Richard >> >> >> >>On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist >><<mailto:kda...@lm...>kda...@lm...> wrote: >>Hi, >>I don't know why the numbers are off. I think the only way to find >>out is to get the results of the Postgres query and line it up next >>to the match results and see what is different. If you can send me >>the lists of IDs from the Postgres query and the match results, I >>can check them. >>Best, >>Dr. D >> >>At 04:57 PM 2/22/2011, you wrote: >>>Here is where I am on the numbers: >>>------------------------------------------------------------------------------------------------------------ >>> >>>XML tags read by Match: >>>4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes >>>------------------------------------------------------------------------------------------------------------ >>> >>>sql query of the db tables: >>>3988 ordered locus + 69 ORF = total 4057 genes >>>------------------------------------------------------------------------------------------------------------ >>> >>>So by my count we are off by 3 genes >>> >>>Maybe XML Match is counting slashed genes as 2 separate genes? >>> >>>IE. so if it encountered 3 slashed genes it would in effect be >>>counting those 3 as 6? >>> >>>Richard >>>On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist >>><<mailto:kda...@lm...>kda...@lm...> wrote: >>>Hi, >>>I compared this list of gene IDs with the list I had on the >>>Testing Report on the wiki and found the following. >>>Your list has 70 IDs. 69 of them are identical to what is in my >>>list of IDs in the Testing report. >>>Indeed, the only difference is the one with the >>>slash. "Rv3346/55c" I looked this ID up at UniProt.org as >>>Rv3346. It appears in the record for UniProt ID O50384. For that >>>UniProt record, the ID referred to in the OrderedLocus tag is >>>Rv3355c which does appear in the gdb already. Looking up Rv3346 >>>on Tuberculist and the Stanford TB database, Rv3346 is not a real >>>gene ID. So this entire record with the slash in it needs to be >>>excluded and all the rest of them need to be captured. >>>I think then, that the numbers should add up correctly, is this true? >>>Best, >>>Dr. D >>>At 01:09 PM 2/22/2011, Richard Brous wrote: >>>>attachement now included... >>>>On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous >>>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>Here is an export of the genes found using: select * from >>>>genenametype where type = 'ORF' and value ~ >>>>'[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. >>>> >>>>647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >>>>5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >>>>647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >>>>647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >>>>647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >>>>647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >>>>211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >>>>648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >>>>648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >>>>648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >>>>648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >>>>648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >>>>649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >>>>649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >>>>649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >>>>649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >>>>650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >>>>174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >>>>174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >>>>312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >>>>331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >>>>445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >>>>621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >>>>622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >>>>622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >>>>622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >>>>622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >>>>623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >>>>623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >>>>623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >>>>623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >>>>624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >>>>624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >>>>625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >>>>625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >>>>626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >>>>626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >>>>626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >>>>626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >>>>627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >>>>627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >>>>627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >>>>627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >>>>627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >>>>627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >>>>627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >>>>628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >>>>629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >>>>629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >>>>629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >>>>630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >>>>630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >>>>631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >>>>632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >>>>632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >>>>633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >>>>633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >>>>645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >>>>645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >>>>645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >>>>645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >>>>645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >>>>646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >>>>646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >>>>646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >>>>646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >>>>646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >>>>646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >>>>647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >>>>647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 >>>>*****The item of note I see is the gene with the slash separating >>>>gene id's which refer to the same gene. >>>> >>>>Richard >>>> >>>>On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous >>>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>Understood. >>>> >>>>I'll check in with Dr. D in the afternoon tomorrow and discuss. >>>> >>>>Richard >>>> >>>>On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio >>>><<mailto:do...@lm...>do...@lm...> wrote: >>>>Hi Rich, >>>>Addressing the release business first, let's put it this way: if >>>>the remaining loose ends can be addressed by tomorrow, we can >>>>probably wait until then. If unexpected snags are encountered, >>>>then it would be worthwhile to release whatever you have. >>>>With that in mind, considering that you pretty much know the >>>>patterns of the IDs that are needed, I think it will only take a >>>>little digital forensic work now to figure out exactly which IDs >>>>are still needed. Once you know what those are, you should: >>>>1. Find where they are in the XML file. >>>>2. Knowing the XML location, find the corresponding table in the >>>>relational database (table names are generally derived from >>>>tag/element names). >>>>3. Knowing the table in the database, write or extend the >>>>SpeciesProfile query to retrieve that data. >>>>For the ID that must *not* be included, again it's a matter of >>>>tracking down what this ID is. Knowing this straggler, you can >>>>then consult with Dr. Dahlquist if this ID is truly a unique >>>>one-off, or is representative of a pattern that we'll want to >>>>exclude. Either way, this ID can be omitted by using "not" or >>>>"<>" or possibly "not like" or "not ~" (check PostgreSQL where >>>>clause syntax to see where the negation can be applied). >>>>John David N. Dionisio, PhD >>>>Associate Professor, Computer Science >>>>Loyola Marymount University >>>>On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >>>> > actually i had a typo (emailing from desktop system but >>>> testing on my laptop... typed correctly here but wrong in >>>> pgadmin) but the results make much more sense now. >>>> > >>>> > >>>> > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value like 'Rv%'; >>>> > returns 4058 >>>> > >>>> > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > returns 4058 >>>> > >>>> > >>>> -------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>>> > >>>> > Continuing forward - >>>> > >>>> > The testing report says that 4066 unique matches exist in XML >>>> but 6 of them were eliminated by Dr. D leaving the desired number at 4060. >>>> > >>>> > So now we are only 2 genes short with the query returning >>>> 4058... which is also (conveniently) the sum of the two separate >>>> queries of 'ordered locus' and 'ORF' respectively. >>>> > >>>> > But recall that Dr. D said that only 69 genes of the missing >>>> 75 were tagged 'ORF' but we seem to have 1 extra gene tagged >>>> 'ORF' than we expected. Adding that into missing genes puts us 3 short... >>>> > >>>> > Should I make the changes to the code and export a gdb so that >>>> analysis can be done or wait until we work this through further? >>>> > >>>> > Richard >>>> > >>>> > >>>> > >>>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > Hi Rich, >>>> > >>>> > The second form should have worked actually. What exactly was >>>> the error? >>>> > >>>> > John David N. Dionisio, PhD >>>> > Associate Professor, Computer Science >>>> > Loyola Marymount University >>>> > >>>> > >>>> > >>>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >>>> > >>>> > > hmm not taking parenthesis where I thought they should go... >>>> syntax error >>>> > > >>>> > > select count (*) from genenametype where type = ('ordered >>>> locus' or 'ORF') and value like 'Rv%'; >>>> > > also tried >>>> > > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value like 'Rv%'; >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > > ah yes... i see it... >>>> > > >>>> > > >>>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > > Watch your parentheses: "and" has greater precedence than "or" :) >>>> > > >>>> > > >>>> > > John David N. Dionisio, PhD >>>> > > Associate Professor, Computer Science >>>> > > Loyola Marymount University >>>> > > >>>> > > >>>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > > >>>> > >> OK, so here are my query results from raw SQL: >>>> > >> >>>> > >> 1. using: like 'Rv%' >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' and value like 'Rv%'; >>>> > >> returns 3988 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ORF' and >>>> value like 'Rv%'; >>>> > >> returns 70 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' or type = 'ORF' and value like 'Rv%'; >>>> > >> returns 7011 >>>> > >> >>>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 3988 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 7011 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ORF' and >>>> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 70 >>>> > >> >>>> > >> Conclusions: >>>> > >> >>>> > >> 1. It seems that querying for type = 'ORF' alone surfaces >>>> the 69 genes were were looking for plus one more (maybe the >>>> count for missing genes is off by 1?). >>>> > >> >>>> > >> 2. Combining the two types in a single query did not >>>> produce the results that I expected (7011? - how did that >>>> happen????) so this is likely not our solution... unless of >>>> course the query syntax isn't actually doing what I think it is... >>>> > >> >>>> > >> 3. I would think the best course of action is to serialy >>>> run two separate queries to capture all the required genes, then >>>> removing the one unneeded gene if its truly not wanted. >>>> > >> >>>> > >> What do you think? >>>> > >> >>>> > >> Richard >>>> > >> >>>> > >> >>>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >> I don't recall the exact details of the missing 69, but if >>>> your query successfully returns them in raw SQL, then this is >>>> worth a try. You can integrate into the same query as long as >>>> the same columns are returned, which is the case here AFAIK, so >>>> go ahead and extend the existing query. >>>> > >> >>>> > >> >>>> > >> John David N. Dionisio, PhD >>>> > >> Associate Professor, Computer Science >>>> > >> Loyola Marymount University >>>> > >> >>>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >> >>>> > >>> So here is the appropriate code snippet from >>>> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>> > >>> public >>>> > >>> >>>> > >>> TableManager >>>> getSystemTableManagerCustomizations(TableManager tableManager, >>>> TableManager primarySystemTableManager, Date version) throws >>>> SQLException, InvalidParameterException { >>>> > >>> >>>> > >>> // Build the base query; we only use "ordered locus" and >>>> we only want >>>> > >>> >>>> > >>> // IDs that begin with "Rv." >>>> > >>> PreparedStatement ps = >>>> ConnectionManager.getRelationalDBConnection().prepareStatement( >>>> > >>> >>>> > >>> "SELECT value, type " + >>>> > >>> >>>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>>> > >>> >>>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>>> > >>> >>>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>>> entrytype_gene_hjid = ?"); >>>> > >>> ResultSet result; >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> for (Row row : primarySystemTableManager.getRows()) { >>>> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >>>> > >>> >>>> > >>> "UID"))); >>>> > >>> result = ps.executeQuery(); >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> // We actually want to keep the case where multiple ordered locus >>>> > >>> >>>> > >>> // names appear. >>>> > >>> >>>> > >>> while (result.next()) { >>>> > >>> >>>> > >>> // We want this name to appear in the OrderedLocusNames >>>> > >>> >>>> > >>> // system table. >>>> > >>> >>>> > >>> for (String id : result.getString("value").split("/")) { >>>> > >>> tableManager.submit( >>>> > >>> >>>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { >>>> "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { >>>> "\"Date\"", >>>> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { >>>> "UID", row.getValue("UID") } }); >>>> > >>> } >>>> > >>> >>>> > >>> } >>>> > >>> >>>> > >>> } >>>> > >>> >>>> > >>> >>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>> So now we want to build the base query which uses "ordered >>>> locus" and "orf" and we only want IDs that begin with "Rv". >>>> > >>> >>>> > >>> I know there are more comprehensive ways to search for >>>> gene ID's by matching gene ID prefix but "like Rv%" seemed to >>>> work thus far, we just need to tell it to search for XML tag >>>> type orf in addition to ordered locus. >>>> > >>> >>>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value >>>> like 'Rv%' and entrytype_gene_hjid = ? " >>>> > >>> >>>> > >>> Here is a stab at it.... This part of our class was right >>>> as the server went down and my submission for week 6 assignment >>>> I can't seem to find. >>>> > >>> >>>> > >>> Is it possible to have two different types in the same >>>> query or should we rewrite a separate query for the orf tag? >>>> > >>> >>>> > >>> Richard >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> >>>> > >>> thanks and will do as directed. >>>> > >>> >>>> > >>> My previous, last paragraph comment - A way for >>>> programming code in email holding its format in a mail message >>>> similarly to how you can post code on forum pages? >>>> > >>> >>>> > >>> <code> >>>> > >>> blah >>>> > >>> blah >>>> > >>> blah >>>> > >>> </code> >>>> > >>> >>>> > >>> thanks! >>>> > >>> >>>> > >>> Richard >>>> > >>> >>>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> >>>> > >>> Greetings, >>>> > >>> >>>> > >>> Actually, gmbuilder.properties is for the TallyEngine >>>> only. When dealing with .gdb exports, look *only* at the >>>> SpeciesProfile class. So, to find those 69 IDs, it is the >>>> SpeciesProfile code, and *only* the SpeciesProfile code, that >>>> needs to be changed. >>>> > >>> >>>> > >>> Your take on how gmbuilder.properties is used, however, is >>>> understandable. It makes sense to assume that the TallyEngine >>>> code *and* the ID export code are based on the same >>>> characterization of the needed IDs. This replication is >>>> originally a historical artifact: SpeciesProfile was done first, >>>> and then TallyEngine was done later by another student. >>>> > >>> >>>> > >>> However, there are other factors beyond history that sort >>>> of necessitate this duplication of desired IDs: (skip the two >>>> bullets below if you'd rather cut to the chase of the work to be >>>> done, and discuss design issues later) >>>> > >>> >>>> > >>> - The actual XML import code is a black box: this is the >>>> "canned" JAXB library actually in action, and not our code at >>>> all. Plus, the XML import code really does not filter (nor >>>> should it), since the goal of the XML->relational database step >>>> is to fully capture the XML data in the relational >>>> database. So, XML count is necessarily separated from XML import. >>>> > >>> >>>> > >>> - The notion of a declarative mechanism for extracting IDs >>>> from the relational database (which is what >>>> gmbuilder.properties/TallyEngine uses) is interesting, but at >>>> the same time there is value in the arbitrary computation that >>>> can be done with Java (case in point: export two versions of an >>>> ID, with and without periods). This is not to say that it is >>>> impossible to do this declaratively, but let's just say that the >>>> procedural approach exists here and now, and a declarative >>>> approach will need more thought. >>>> > >>> >>>> > >>> These, and other factors, are good thoughts to hold onto >>>> and would be worthy of a good meeting discussion sometime, but >>>> bottom line for now: modifying the export behavior is a matter >>>> of editing the *SpeciesProfile* Java code, and not the >>>> gmbuilder.properties file. Turn your attention to that code. >>>> > >>> >>>> > >>> Now, as to annotating your code...I'd just put in code >>>> comments :) Or did you mean something else by tagging code in e-mail? >>>> > >>> >>>> > >>> John David N. Dionisio, PhD >>>> > >>> Associate Professor, Computer Science >>>> > >>> Loyola Marymount University >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>>> > >>> >>>> > >>> > also, how do I tag code in email so it holds its >>>> formatting? I tried a few suggestions I found on the web but >>>> they aren't holding formatting or i'm just doing it wrong ;-D >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> > OK, have some updates and some suggestions: >>>> > >>> > >>>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the >>>> gene testing report. We verified that XML match does indeed find >>>> 4066 unique matches - 75 of which are not in the gdb and need to be. >>>> > >>> > >>>> > >>> > Dr. Dahlquist informed me that she was the one who >>>> completed the gene db testing report, not a previous student of >>>> BIO367 and had already verified which genes were missing and >>>> where they were to be found. I had (mistakenly) assumed that >>>> since a student had performed the gene database testing I had to >>>> redo all of the verification. >>>> > >>> > >>>> > >>> > So that said, of the 75 genes missing - 69 need to be >>>> included and 6 excluded. >>>> > >>> > Per the gene db testing report: "69 of them have an "a", >>>> "b", or "d" suffix. They are all found in the ORF tag and need >>>> to be included in the gdb." >>>> > >>> > >>>> > >>> > To solve this we need to add additional search criteria >>>> into the M. tuberculosis section in gmbuilder.properties below: >>>> > >>> > # Mycobacterium tuberculosis >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_level_amount= >>>> > >>> > >>>> > >>> > 1 >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_element_level0= >>>> > >>> > >>>> > >>> > uniprot/entry/gene/name&type&ordered locus >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_query_level0= >>>> > >>> > >>>> > >>> > select count(*) from genenametype where type = 'ordered >>>> locus' and value like 'Rv%'; >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>> > >>>> > >>> > Ordered Locus >>>> > >>> > SOLUTIONS: >>>> > >>> > >>>> > >>> > 1. So am i correct in my understanding that the second >>>> line is the query used by TallyEngine to read the XML file? If >>>> so then this is the issue we need to table for the moment until >>>> we get the gbd verified and re-released. We will revisit this to >>>> discover why it is not only reporting incorrectly but also why >>>> its added a second row of Ordered Locus on the TallyEngine results page. >>>> > >>> > >>>> > >>> > 2. The third line is the SQL query used by postgres >>>> during the export from XML to gdb. To find and get the ORF >>>> tagged genes could we not add the following lines and change the >>>> count in the first line: >>>> > >>> > >>>> > >>> > >>>> > >>> > # Mycobacterium tuberculosis >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>>> > >>> > >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_query_level0= >>>> > >>> > >>>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) >>>> from genenametype where type = 'orf'; >>>> > >>> > >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>> > >>>> > >>> > Ordered Locus >>>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>>> > >>> > >>>> > >>> > >>>> ---------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>> > >>>> > >>> > Of course these queries would have be manually verified >>>> prior to making these changes but this seems like we are moving >>>> in the right direction. >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > >>>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> > Just got done reading previous email and understand the >>>> change in priority. >>>> > >>> > >>>> > >>> > Will work on the missing ID's for now and shelve the the >>>> TalleyEngine issue for the moment. >>>> > >>> > >>>> > >>> > Also great about a more formalized weekly meeting. I was >>>> going to suggest it myself as it has been slow going so far as >>>> maybe i'm a bit too independent in this independent study class =D >>>> > >>> > >>>> > >>> > Will dig further into the missing ID's later tonight and >>>> during day tomorrow and report back. >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > Hi Rich, >>>> > >>> > >>>> > >>> > No problem. The pertinent line you're referring to, for >>>> XML, is this, right above the line you copied: >>>> > >>> > >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > >>>> > >>> > The slash-separated section is the "path" of XML tags >>>> leading to the element of interest; then, after the ampersand, >>>> is a name/value pair for the desired attribute to count. Note >>>> that there is no hint of a *content*-based filter (nor is there >>>> the capability for one, as far as I can tell in the code). By >>>> "content," I mean that we can't specify filters based on what's >>>> *between* the tags. We can only go as far as filter by >>>> attribute value, e.g., type="ordered locus". >>>> > >>> > >>>> > >>> > But anyway, as mentioned in the earlier e-mail, let's >>>> have the missing IDs in the .gdb take precedence for >>>> now. Please take a look at the tuberculosis, A. thaliana, and >>>> P. falciparum profiles to get an idea for how the ID output can >>>> be customized, then let me know if you have any questions or >>>> need to confirm anything. >>>> > >>> > >>>> > >>> > John David N. Dionisio, PhD >>>> > >>> > Associate Professor, Computer Science >>>> > >>> > Loyola Marymount University >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>>> > >>> > >>>> > >>> > > Sorry been slammed with a programming assignment that >>>> kept needing continued iteration and it has been all consuming >>>> until last night. But I did get a chance to work with your >>>> comments and review the code again with a different mind set. >>>> > >>> > > >>>> > >>> > > Yes, I examined the gmbuilder.properties file ( the >>>> query is also in the >>>> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >>>> in a previous email ) but I don't think I see what you mean >>>> regarding the XML count. >>>> > >>> > > >>>> > >>> > > I understood that: >>>> mycobacteriumtuberculosis_query_level0=select count(*) from >>>> genenametype where type = 'ordered locus' and value like >>>> 'Rv%'; was the db query but don't see which is the XML count... >>>> or do they share the same query and you are saying that XML >>>> count doesn't recognize and therefore cannot use the 'Rv%' parameter? >>>> > >>> > > >>>> > >>> > > Richard >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > Hi Rich, >>>> > >>> > > >>>> > >>> > > Sorry for the delay. Had some distractions coming >>>> into the weekend. >>>> > >>> > > >>>> > >>> > > You've looked at the code; have you looked at >>>> gmbuilder.properties? (I may have mentioned it a few e-mails >>>> ago, just as you were starting to dig into this) >>>> > >>> > > >>>> > >>> > > On the copy I have, the M. tuberculosis block looks >>>> like this (indentation is mine to set it apart): >>>> > >>> > > >>>> > >>> > > # Mycobacterium tuberculosis >>>> > >>> > > mycobacteriumtuberculosis_level_amount=1 >>>> > >>> > > >>>> > >>> > > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > > >>>> > >>> > > mycobacteriumtuberculosis_query_level0=select >>>> count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>>> > >>> > > >>>> > >>> > > >>>> mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>>> > >>> > > >>>> > >>> > > There, I think, is the rub. Notice that the XML count >>>> does not filter on RV%. The SQL query does. >>>> > >>> > > >>>> > >>> > > Unfortunately, I don't think the TallyEngine can >>>> include selective filtering in the XML counts. If the need to >>>> do selective filtering on XML is necessary, then I think we're >>>> looking at a new functionality for you to implement (or, if this >>>> throws things off too much, this may have to be noted somewhere, >>>> that the XML vs. database counts may be off because the database >>>> count is doing some text-based filtering but the XML count does not). >>>> > >>> > > >>>> > >>> > > What does xmlpipedb-match say? That will at least >>>> tell you whether the 'RV%' count is indeed correct. >>>> > >>> > > >>>> > >>> > > John David N. Dionisio, PhD >>>> > >>> > > Associate Professor, Computer Science >>>> > >>> > > Loyola Marymount University >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>>> > >>> > > >>>> > >>> > > > OK here is what I was able to put together from the >>>> past few hours of code review: >>>> > >>> > > > >>>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>> > >>> > > > -reveals that after the 2 System table modifications >>>> are made adding species name and link, a PreparedStatement is >>>> instantiated which builds and calls the base query. >>>> > >>> > > > >>>> > >>> > > > -The base query called is: ("SELECT value, type " + >>>> "FROM genenametype INNER JOIN entrytype_genetype " + >>>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>>> entrytype_gene_hjid = ?") >>>> > >>> > > > >>>> > >>> > > > -So its looking in 'ordered locus' table/column for >>>> any tuple that starts with Rv (followed by any substring) and >>>> entrytype_gene_hjid = ? . >>>> > >>> > > > The 'like' comparator and % usage are clear with the >>>> 'type' entrytype_gene_hjid = ? >>>> > >>> > > > >>>> > >>> > > > -To me it seems the query makes sense so the problem >>>> is likely elsewhere. >>>> > >>> > > > >>>> > >>> > > > GenMappBuilder.java: >>>> > >>> > > > -I found method doTallies() at code line 895 which: >>>> > >>> > > > Instantiates a Configuration called >>>> hibernateConfiguration and assigns to it the current hibernate configuration >>>> > >>> > > > Validates database settings by analyzing >>>> hibernateConfiguration >>>> > >>> > > > Instantiates a CriterionList for uniprot and assigns >>>> to it TallyType.UNIPROT >>>> > >>> > > > Instantiates a CriterionList for go and assigns to >>>> it TallyType.GO >>>> > >>> > > > Determines if both xml files exist >>>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase >>>> are run on both xml files and their respective CriterionList >>>> > >>> > > > Results are then formatted for display in a table. >>>> > >>> > > > >>>> > >>> > > > -So enum TallyType which means that they are the >>>> only valid datatypes which TallyEngine accepts... go to know ... >>>> > >>> > > > >>>> > >>> > > > -Based on the screen shot of Tally Engine it would >>>> seem that both getTallyResultsXML() and >>>> getTallyResultsDatabase() are incorrectly returning. Likely due >>>> to both using an incorrect query (as we previously supposed). >>>> But where are the queries?... the more I dig the more I think >>>> they are in the criterial all the work is done against. >>>> > >>> > > > >>>> > >>> > > > continuing the review: >>>> > >>> > > > getTallyResultsXML() calls Tally Engine instance >>>> method getXmlFileCounts(xmlFile) >>>> > >>> > > > getTallyResultsDatabase() calls Tally Engine >>>> instance method getDbcounts(new QueryEngine(hibernateConfiguration) >>>> > >>> > > > Both of these instanced methods originate from >>>> TallyEngine.java... >>>> > >>> > > > >>>> > >>> > > > TallyEngine.java: >>>> > >>> > > > >>>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which >>>> instantiates a digester then processes against criteria... but >>>> this quickly becomes confusing and is hard to follow >>>> > >>> > > > >>>> > >>> > > > getDbcounts() then starts a db session and executes >>>> a query but then I also get a bit lost with my limited db knowledge. >>>> > >>> > > > >>>> > >>> > > > >>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>> >>>> > >>> > > > >>>> > >>> > > > OVERALL I think I'm getting closer to the issues but >>>> I still feel as if I'm missing some understanding to proceed >>>> further. Can you pass along some of that Dondi insight and steer >>>> me in the right direction? =D >>>> > >>> > > > >>>> > >>> > > > -DB Tally - Not having taken databases yet certainly >>>> is limiting my ability determine where the "criteria" are being >>>> set and how they are followed during session activities. Also is >>>> the query we have been looking for this whole time in the >>>> criteria or someplace else? >>>> > >>> > > > >>>> > >>> > > > -XML Tally - again is the query contained within the >>>> criteria that digestXmlFile() uses to parse? >>>> > >>> > > > >>>> > >>> > > > Richard >>>> > >>> > > > >>>> > >>> > > > >>>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > > Right, schema issues are unlikely. Most count >>>> discrepancies like this that I've seen have boiled down to >>>> forming the right query. Then, knowing the right query (in both >>>> XML and SQL), it's a matter of making sure that TallyEngine asks >>>> that same query. >>>> > >>> > > > >>>> > >>> > > > John David N. Dionisio, PhD >>>> > >>> > > > Associate Professor, Computer Science >>>> > >>> > > > Loyola Marymount University >>>> > >>> > > > >>>> > >>> > > > >>>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>>> > >>> > > > >>>> > >>> > > > > OK, so based on your approach: >>>> > >>> > > > > >>>> > >>> > > > > 1. I'll start with reviewing the queries for >>>> xmlpipedb-match and sql queries needed for the respective >>>> results as you requested. >>>> > >>> > > > > >>>> > >>> > > > > I was also thinking I may need to review the >>>> schema from xml into postgres but the issue isn't likely a >>>> schema error. The error most likely lies in how xmlpipedbutils >>>> queries the data from xml source and writes to the tables what it returns? >>>> > >>> > > > > >>>> > >>> > > > > 2. I'll review the code: trace the entrance of >>>> tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >>>> > >>> > > > > >>>> > >>> > > > > Richard >>>> > >>> > > > > >>>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the >>>> first e-mail) --- the XMLPipeDB Utilities source code is in >>>> trunk/xmlpipedbutils in SourceForge's Subversion repo. >>>> > >>> > > > > >>>> > >>> > > > > John David N. Dionisio, PhD >>>> > >>> > > > > Associate Professor, Computer Science >>>> > >>> > > > > Loyola Marymount University >>>> > >>> > > > > >>>> > >>> > > > > >>>> > >>> > > > > >>>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>>> > >>> > > > > >>>> > >>> > > > > > Hi Dondi, >>>> > >>> > > > > > >>>> > >>> > > > > > So I'm at the point in working with M >>>> tuberculosis that I was able to exactly reproduce Dr. >>>> Dahlquist's problematic TallyEngine results. >>>> > >>> > > > > > >>>> > >>> > > > > > gmb2b60 Results >>>> > >>> > > > > > >>>> > >>> > > > > > >>>> > >>> > > > > > >>>> > >>> > > > > > Now the proverbial question - What next to solve >>>> the Ordered Locus import/count issue? >>>> > >>> > > > > > >>>> > >>> > > > > > ********************************************** >>>> > >>> > > > > > Here is my thought process: >>>> > >>> > > > > > >>>> > >>> > > > > > Step 1: How does the import process work at the >>>> high level? (obviously correct me if I'm wrong) >>>> > >>> > > > > > >>>> > >>> > > > > > I believe that basically as each XML tag is >>>> read, it is placed in the proper Postgres table(s) based on some >>>> criteria. There is also likely some sort of check that each >>>> individual tag is in valid XML format unless we don't care at >>>> this stage (care at export) or maybe the parser just skips over >>>> and goes on to the next . >>>> > >>> > > > > > >>>> > >>> > > > > > Step 2: What could be the problem? >>>> > >>> > > > > > >>>> > >>> > > > > > Either - >>>> > >>> > > > > > a. XML tags are being parsed incorrectly >>>> (ignored/skipped)? >>>> > >>> > > > > > b. Decision criteria of which table they should >>>> be added to? >>>> > >>> > > > > > >>>> > >>> > > > > > ********************************************** >>>> > >>> > > > > > >>>> > >>> > > > > > I read on the sourceforge wiki: >>>> > >>> > > > > > >>>> > >>> > > > > > XMLPipeDB has a modular architecture with three >>>> components that may be used separately or together. XSD-to-DB >>>> reads an XSD (XML Schema Definition) and automatically generates >>>> an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB >>>> Utilities provides functionality for configuring the database, >>>> importing data, and performing queries. GenMAPP Builder is based >>>> on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene >>>> Databases based on data from UniProt and Gene Ontology (GO). >>>> > >>> > > > > > >>>> > >>> > > > > > So I should probably start with the XMLPipeDB >>>> Utilities which are where? I don't see any in the basic >>>> distribution or are they not standalone and called from the command line? >>>> > >>> > > > > > >>>> > >>> > > > > > Thanks! >>>> > >>> > > > > > >>>> > >>> > > > > > Richard > > > >------------------------------------------------------------------------------ >Free Software Download: Index, Search & Analyze Logs and other IT data in >Real-Time with Splunk. Collect, index and harness all the fast moving IT data >generated by your applications, servers and devices whether physical, virtual >or in the cloud. Deliver compliance at lower cost and gain new business >insights. ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > >------------------------------------------------------------------------------ >Free Software Download: Index, Search & Analyze Logs and other IT data in >Real-Time with Splunk. Collect, index and harness all the fast moving IT data >generated by your applications, servers and devices whether physical, virtual >or in the cloud. Deliver compliance at lower cost and gain new business >insights. ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |
From: Richard B. <rbr...@gm...> - 2011-02-23 19:16:20
|
Hmm... possible that the method to split up records has been globally implemented for export from Postgres? Since the records are automatically splitting when exported from postgres to gdb file, all we need to do is exclude Rv3346/55c 'ORF' from "select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set. That seem correct Dondi? Richard On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I checked these out. They all should be separated and kept as separate > OrderedLocusNames records. In fact, the gdb already has them separated into > individual records, so it is only the tally for Postgres that is off, they > were correctly exported into the gdb. > > Best, > Dr. D > > > At 07:31 PM 2/22/2011, Richard Brous wrote: > > OK I found the culprits: > > There are 3 ordered locus gene ID's that have slashes. As was previously > discussed it seems XML Match is counting each one twice which would inflate > the count by 3. > > Rv2561/Rv2562 'ordered locus' > Rv2880c/Rv2879c 'ordered locus' > Rv3021c/Rv3022c 'ordered locus' > > And as was previously noted, Rv3346/55c 'ORF' should be excluded. > > This will bring the both XML and sql db queries counts in sync at 4057. > > By my math: > > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by > slashes = total 4057 genes > > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > > > TO MOVE FORWARD: > > 1. We need to decide how to address the slash ID's (should we keep and > split into separate tuples or should they be excluded) > > 2. Adjust the queries to reflect what is needed and test with raw sql. > > 3. Update the queries on > MycobacteriumTuberculosisUniProtSpeciesProfile.java > > 4. Export a new Mtb gene database for testing. > > > Richard > > > > On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I don't know why the numbers are off. I think the only way to find out is > to get the results of the Postgres query and line it up next to the match > results and see what is different. If you can send me the lists of IDs from > the Postgres query and the match results, I can check them. > > Best, > Dr. D > > > At 04:57 PM 2/22/2011, you wrote: > > Here is where I am on the numbers: > > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes > > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > > ------------------------------------------------------------------------------------------------------------ > So by my count we are off by 3 genes > > Maybe XML Match is counting slashed genes as 2 separate genes? > > IE. so if it encountered 3 slashed genes it would in effect be counting > those 3 as 6? > > Richard > > On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > I compared this list of gene IDs with the list I had on the Testing Report > on the wiki and found the following. > Your list has 70 IDs. 69 of them are identical to what is in my list of > IDs in the Testing report. > Indeed, the only difference is the one with the slash. "Rv3346/55c" I > looked this ID up at UniProt.org as Rv3346. It appears in the record for > UniProt ID O50384. For that UniProt record, the ID referred to in the > OrderedLocus tag is Rv3355c which does appear in the gdb already. Looking > up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real > gene ID. So this entire record with the slash in it needs to be excluded > and all the rest of them need to be captured. > > I think then, that the numbers should add up correctly, is this true? > Best, Dr. D > > At 01:09 PM 2/22/2011, Richard Brous wrote: > > attachement now included... > On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous <rbr...@gm...> wrote: > Here is an export of the genes found using: select * from genenametype > where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also > attached as a csv file. > 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 > 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 > 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 > 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 > 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 > 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 > 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 > 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 > 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 > 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 > 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 > 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 > 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 > 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 > 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 > 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 > 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 > 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 > 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 > 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 > 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 > 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 > 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 > 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 > 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 > 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 > 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 > 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 > 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 > 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 > 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 > 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 > 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 > 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 > 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 > 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 > 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 > 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 > 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 > 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 > 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 > 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 > 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 > 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 > 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 > 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 > 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 > 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 > 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 > 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 > 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 > 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 > 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 > 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 > 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 > 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 > 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 > 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 > 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 > 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 > 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 > 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 > 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 > 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 > 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 > 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 > 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 > 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 > 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 > 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The > item of note I see is the gene with the slash separating gene id's which > refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 PM, > Richard Brous <rbr...@gm...> wrote: Understood. I'll check in with > Dr. D in the afternoon tomorrow and discuss. Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> > wrote: Hi Rich, Addressing the release business first, let's put it this > way: if the remaining loose ends can be addressed by tomorrow, we can > probably wait until then. If unexpected snags are encountered, then it > would be worthwhile to release whatever you have. With that in mind, > considering that you pretty much know the patterns of the IDs that are > needed, I think it will only take a little digital forensic work now to > figure out exactly which IDs are still needed. Once you know what those > are, you should: 1. Find where they are in the XML file. 2. Knowing the > XML location, find the corresponding table in the relational database (table > names are generally derived from tag/element names). 3. Knowing the table > in the database, write or extend the SpeciesProfile query to retrieve that > data. For the ID that must *not* be included, again it's a matter of > tracking down what this ID is. Knowing this straggler, you can then consult > with Dr. Dahlquist if this ID is truly a unique one-off, or is > representative of a pattern that we'll want to exclude. Either way, this ID > can be omitted by using "not" or "<>" or possibly "not like" or "not ~" > (check PostgreSQL where clause syntax to see where the negation can be > applied). John David N. Dionisio, PhD Associate Professor, Computer > Science Loyola Marymount University > > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > actually i had a typo > (emailing from desktop system but testing on my laptop... typed correctly > here but wrong in pgadmin) but the results make much more sense now. > > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > returns 4058 > > select count (*) from > genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > > > -------------------------------------------------------------------------------------------------------------------------- > > > > Continuing forward - > > The testing report says that 4066 unique > matches exist in XML but 6 of them were eliminated by Dr. D leaving the > desired number at 4060. > > So now we are only 2 genes short with the > query returning 4058... which is also (conveniently) the sum of the two > separate queries of 'ordered locus' and 'ORF' respectively. > > But > recall that Dr. D said that only 69 genes of the missing 75 were tagged > 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding > that into missing genes puts us 3 short... > > Should I make the changes > to the code and export a gdb so that analysis can be done or wait until we > work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at > 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The > second form should have worked actually. What exactly was the error? > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard > Brous wrote: > > > hmm not taking parenthesis where I thought they should > go... syntax error > > > > select count (*) from genenametype where type = > ('ordered locus' or 'ORF') and value like 'Rv%'; > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 > at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see > it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio > <do...@lm...> wrote: > > Watch your parentheses: "and" has greater > precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> > wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> > 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where > type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > > >> returns 70 > >> > >> select count (*) from genenametype where type = > 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 > > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > >> select count (*) from genenametype where type = 'ordered locus' > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 > genes were were looking for plus one more (maybe the count for missing genes > is off by 1?). > >> > >> 2. Combining the two types in a single query did > not produce the results that I expected (7011? - how did that happen????) so > this is likely not our solution... unless of course the query syntax isn't > actually doing what I think it is... > >> > >> 3. I would think the best > course of action is to serialy run two separate queries to capture all the > required genes, then removing the one unneeded gene if its truly not wanted. > > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb > 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I > don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, > Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, > 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So > here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > >>> > >>> // Build the base > query; we only use "ordered locus" and we only want > >>> > >>> // IDs > that begin with "Rv." > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN > entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = > entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and > value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; > > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) > { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually > want to keep the case where multiple ordered locus > >>> > >>> // names > appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this > name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > > >>> for (String id : result.getString("value").split("/")) { > >>> > tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, > new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, > { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { > "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there > are more comprehensive ways to search for gene ID's by matching gene ID > prefix but "like Rv%" seemed to work thus far, we just need to tell it to > search for XML tag type orf in addition to ordered locus. > >>> > >>> > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part > of our class was right as the server went down and my submission for week 6 > assignment I can't seem to find. > >>> > >>> Is it possible to have two > different types in the same query or should we rewrite a separate query for > the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, > 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>> > thanks and will do as directed. > >>> > >>> My previous, last paragraph > comment - A way for programming code in email holding its format in a mail > message similarly to how you can post code on forum pages? > >>> > >>> > <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David > N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> > Actually, gmbuilder.properties is for the TallyEngine only. When dealing > with .gdb exports, look *only* at the SpeciesProfile class. So, to find > those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile > code, that needs to be changed. > >>> > >>> Your take on how > gmbuilder.properties is used, however, is understandable. It makes sense to > assume that the TallyEngine code *and* the ID export code are based on the > same characterization of the needed IDs. This replication is originally a > historical artifact: SpeciesProfile was done first, and then TallyEngine was > done later by another student. > >>> > >>> However, there are other > factors beyond history that sort of necessitate this duplication of desired > IDs: (skip the two bullets below if you'd rather cut to the chase of the > work to be done, and discuss design issues later) > >>> > >>> - The actual > XML import code is a black box: this is the "canned" JAXB library actually > in action, and not our code at all. Plus, the XML import code really does > not filter (nor should it), since the goal of the XML->relational database > step is to fully capture the XML data in the relational database. So, XML > count is necessarily separated from XML import. > >>> > >>> - The notion > of a declarative mechanism for extracting IDs from the relational database > (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at > the same time there is value in the arbitrary computation that can be done > with Java (case in point: export two versions of an ID, with and without > periods). This is not to say that it is impossible to do this > declaratively, but let's just say that the procedural approach exists here > and now, and a declarative approach will need more thought. > >>> > >>> > These, and other factors, are good thoughts to hold onto and would be worthy > of a good meeting discussion sometime, but bottom line for now: modifying > the export behavior is a matter of editing the *SpeciesProfile* Java code, > and not the gmbuilder.properties file. Turn your attention to that code. > > >>> > >>> Now, as to annotating your code...I'd just put in code comments > :) Or did you mean something else by tagging code in e-mail? > >>> > >>> > John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, > 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag > code in email so it holds its formatting? I tried a few suggestions I found > on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, > Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates > and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat > down and reviewed the gene testing report. We verified that XML match does > indeed find 4066 unique matches - 75 of which are not in the gdb and need to > be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who > completed the gene db testing report, not a previous student of BIO367 and > had already verified which genes were missing and where they were to be > found. I had (mistakenly) assumed that since a student had performed the > gene database testing I had to redo all of the verification. > >>> > > >>> > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > >>> > Per the gene db testing report: "69 of them have an "a", > "b", or "d" suffix. They are all found in the ORF tag and need to be > included in the gdb." > >>> > > >>> > To solve this we need to add > additional search criteria into the M. tuberculosis section in > gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my > understanding that the second line is the query used by TallyEngine to read > the XML file? If so then this is the issue we need to table for the moment > until we get the gbd verified and re-released. We will revisit this to > discover why it is not only reporting incorrectly but also why its added a > second row of Ordered Locus on the TallyEngine results page. > >>> > > >>> > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > >>> > > >>> > > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > >>> > Of course these queries would have be manually verified > prior to making these changes but this seems like we are moving in the right > direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, > 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just > got done reading previous email and understand the change in priority. > > >>> > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more > formalized weekly meeting. I was going to suggest it myself as it has been > slow going so far as maybe i'm a bit too independent in this independent > study class =D > >>> > > >>> > Will dig further into the missing ID's > later tonight and during day tomorrow and report back. > >>> > > >>> > > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. > Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No > problem. The pertinent line you're referring to, for XML, is this, right > above the line you copied: > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > >>> > The slash-separated section is the "path" of XML > tags leading to the element of interest; then, after the ampersand, is a > name/value pair for the desired attribute to count. Note that there is no > hint of a *content*-based filter (nor is there the capability for one, as > far as I can tell in the code). By "content," I mean that we can't specify > filters based on what's *between* the tags. We can only go as far as filter > by attribute value, e.g., type="ordered locus". > >>> > > >>> > But > anyway, as mentioned in the earlier e-mail, let's have the missing IDs in > the .gdb take precedence for now. Please take a look at the tuberculosis, > A. thaliana, and P. falciparum profiles to get an idea for how the ID output > can be customized, then let me know if you have any questions or need to > confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous > wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment > that kept needing continued iteration and it has been all consuming until > last night. But I did get a chance to work with your comments and review the > code again with a different mind set. > >>> > > > >>> > > Yes, I examined > the gmbuilder.properties file ( the query is also in the > MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > >>> > > > >>> > > I understood that: > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; was the db query but > don't see which is the XML count... or do they share the same query and you > are saying that XML count doesn't recognize and therefore cannot use the > 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the > delay. Had some distractions coming into the weekend. > >>> > > > >>> > > > You've looked at the code; have you looked at gmbuilder.properties? (I may > have mentioned it a few e-mails ago, just as you were starting to dig into > this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block > looks like this (indentation is mine to set it apart): > >>> > > > >>> > > > # Mycobacterium tuberculosis > >>> > > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > > >>> > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does > not filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, > I don't think the TallyEngine can include selective filtering in the XML > counts. If the need to do selective filtering on XML is necessary, then I > think we're looking at a new functionality for you to implement (or, if this > throws things off too much, this may have to be noted somewhere, that the > XML vs. database counts may be off because the database count is doing some > text-based filtering but the XML count does not). > >>> > > > >>> > > What > does xmlpipedb-match say? That will at least tell you whether the 'RV%' > count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount > University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at > 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was > able to put together from the past few hours of code review: > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > > -reveals that after the 2 System table modifications are made adding species > name and link, a PreparedStatement is instantiated which builds and calls > the base query. > >>> > > > > >>> > > > -The base query called is: > ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " > + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE > type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for > any tuple that starts with Rv (followed by any substring) and > entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage > are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method > doTallies() at code line 895 which: > >>> > > > Instantiates a > Configuration called hibernateConfiguration and assigns to it the current > hibernate configuration > >>> > > > Validates database settings by > analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList > for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a > CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines > if both xml files exist > >>> > > > Then getTallyResultsXML and > getTallyResultsDatabase are run on both xml files and their respective > CriterionList > >>> > > > Results are then formatted for display in a > table. > >>> > > > > >>> > > > -So enum TallyType which means that they > are the only valid datatypes which TallyEngine accepts... go to know ... > > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would > seem that both getTallyResultsXML() and getTallyResultsDatabase() are > incorrectly returning. Likely due to both using an incorrect query (as we > previously supposed). But where are the queries?... the more I dig the more > I think they are in the criterial all the work is done against. > >>> > > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() > calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new > QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced > methods originate from TallyEngine.java... > >>> > > > > >>> > > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls > digestXmlFile() which instantiates a digester then processes against > criteria... but this quickly becomes confusing and is hard to follow > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a > query but then I also get a bit lost with my limited db knowledge. > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues > but I still feel as if I'm missing some understanding to proceed further. > Can you pass along some of that Dondi insight and steer me in the right > direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken > databases yet certainly is limiting my ability determine where the > "criteria" are being set and how they are followed during session > activities. Also is the query we have been looking for this whole time in > the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again > is the query contained within the criteria that digestXmlFile() uses to > parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> > wrote: > >>> > > > Right, schema issues are unlikely. Most count > discrepancies like this that I've seen have boiled down to forming the right > query. Then, knowing the right query (in both XML and SQL), it's a matter > of making sure that TallyEngine asks that same query. > >>> > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer > Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. > I'll start with reviewing the queries for xmlpipedb-match and sql queries > needed for the respective results as you requested. > >>> > > > > > >>> > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the > code: trace the entrance of tally engine in the gmbuilder code then follow > it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. > Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm > (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source > code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate > Professor, Computer Science > >>> > > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at > 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > >>> > > > > > So I'm at the point in working with M > tuberculosis that I was able to exactly reproduce Dr. Dahlquist's > problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 > Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > Now the proverbial question - What next to solve the Ordered Locus > import/count issue? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > Here is my > thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the > import process work at the high level? (obviously correct me if I'm wrong) > > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is > read, it is placed in the proper Postgres table(s) based on some criteria. > There is also likely some sort of check that each individual tag is in valid > XML format unless we don't care at this stage (care at export) or maybe the > parser just skips over and goes on to the next . > >>> > > > > > > >>> > > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly > (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table > they should be added to? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > > >>> > > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > > XMLPipeDB has a modular architecture with three components that may be used > separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and > automatically generates an SQL schema, Java classes, and Hibernate mappings. > XMLPipeDB Utilities provides functionality for configuring the database, > importing data, and performing queries. GenMAPP Builder is based on the > XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on > data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I > don't see any in the basic distribution or are they not standalone and > called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > >>> > > > > > Richard > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-02-23 19:04:16
|
Hi, I checked these out. They all should be separated and kept as separate OrderedLocusNames records. In fact, the gdb already has them separated into individual records, so it is only the tally for Postgres that is off, they were correctly exported into the gdb. Best, Dr. D At 07:31 PM 2/22/2011, Richard Brous wrote: >OK I found the culprits: > >There are 3 ordered locus gene ID's that have slashes. As was >previously discussed it seems XML Match is counting each one twice >which would inflate the count by 3. > >Rv2561/Rv2562 'ordered locus' >Rv2880c/Rv2879c 'ordered locus' >Rv3021c/Rv3022c 'ordered locus' > >And as was previously noted, Rv3346/55c 'ORF' should be excluded. > >This will bring the both XML and sql db queries counts in sync at 4057. > >By my math: >------------------------------------------------------------------------------------------------------------ >XML tags read by Match: >4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates >caused by slashes = total 4057 genes >------------------------------------------------------------------------------------------------------------ >sql query of the db tables: >3988 ordered locus + 69 ORF = total 4057 genes >------------------------------------------------------------------------------------------------------------ > > >TO MOVE FORWARD: > >1. We need to decide how to address the slash ID's (should we keep >and split into separate tuples or should they be excluded) > >2. Adjust the queries to reflect what is needed and test with raw sql. > >3. Update the queries on MycobacteriumTuberculosisUniProtSpeciesProfile.java > >4. Export a new Mtb gene database for testing. > > >Richard > > > >On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: >Hi, > >I don't know why the numbers are off. I think the only way to find >out is to get the results of the Postgres query and line it up next >to the match results and see what is different. If you can send me >the lists of IDs from the Postgres query and the match results, I >can check them. > >Best, >Dr. D > > >At 04:57 PM 2/22/2011, you wrote: >>Here is where I am on the numbers: >>------------------------------------------------------------------------------------------------------------ >>XML tags read by Match: >>4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes >>------------------------------------------------------------------------------------------------------------ >>sql query of the db tables: >>3988 ordered locus + 69 ORF = total 4057 genes >>------------------------------------------------------------------------------------------------------------ >>So by my count we are off by 3 genes >> >>Maybe XML Match is counting slashed genes as 2 separate genes? >> >>IE. so if it encountered 3 slashed genes it would in effect be >>counting those 3 as 6? >> >>Richard >> >>On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist >><<mailto:kda...@lm...>kda...@lm...> wrote: >>Hi, >>I compared this list of gene IDs with the list I had on the Testing >>Report on the wiki and found the following. >>Your list has 70 IDs. 69 of them are identical to what is in my >>list of IDs in the Testing report. >>Indeed, the only difference is the one with the >>slash. "Rv3346/55c" I looked this ID up at UniProt.org as >>Rv3346. It appears in the record for UniProt ID O50384. For that >>UniProt record, the ID referred to in the OrderedLocus tag is >>Rv3355c which does appear in the gdb already. Looking up Rv3346 on >>Tuberculist and the Stanford TB database, Rv3346 is not a real gene >>ID. So this entire record with the slash in it needs to be >>excluded and all the rest of them need to be captured. >> >>I think then, that the numbers should add up correctly, is this true? >>Best, >>Dr. D >> >>At 01:09 PM 2/22/2011, Richard Brous wrote: >>>attachement now included... >>>On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous >>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>Here is an export of the genes found using: select * from >>>genenametype where type = 'ORF' and value ~ >>>'[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. >>> >>>647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >>>5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >>>647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >>>647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >>>647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >>>647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >>>211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >>>648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >>>648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >>>648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >>>648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >>>648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >>>649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >>>649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >>>649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >>>649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >>>650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >>>174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >>>174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >>>312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >>>331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >>>445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >>>621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >>>622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >>>622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >>>622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >>>622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >>>623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >>>623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >>>623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >>>623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >>>624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >>>624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >>>625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >>>625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >>>626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >>>626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >>>626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >>>626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >>>627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >>>627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >>>627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >>>627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >>>627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >>>627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >>>627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >>>628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >>>629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >>>629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >>>629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >>>630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >>>630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >>>631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >>>632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >>>632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >>>633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >>>633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >>>645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >>>645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >>>645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >>>645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >>>645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >>>646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >>>646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >>>646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >>>646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >>>646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >>>646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >>>647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >>>647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 >>>*****The item of note I see is the gene with the slash separating >>>gene id's which refer to the same gene. >>> >>>Richard >>> >>>On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous >>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>Understood. >>> >>>I'll check in with Dr. D in the afternoon tomorrow and discuss. >>> >>>Richard >>> >>>On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio >>><<mailto:do...@lm...>do...@lm...> wrote: >>>Hi Rich, >>>Addressing the release business first, let's put it this way: if >>>the remaining loose ends can be addressed by tomorrow, we can >>>probably wait until then. If unexpected snags are encountered, >>>then it would be worthwhile to release whatever you have. >>>With that in mind, considering that you pretty much know the >>>patterns of the IDs that are needed, I think it will only take a >>>little digital forensic work now to figure out exactly which IDs >>>are still needed. Once you know what those are, you should: >>>1. Find where they are in the XML file. >>>2. Knowing the XML location, find the corresponding table in the >>>relational database (table names are generally derived from >>>tag/element names). >>>3. Knowing the table in the database, write or extend the >>>SpeciesProfile query to retrieve that data. >>>For the ID that must *not* be included, again it's a matter of >>>tracking down what this ID is. Knowing this straggler, you can >>>then consult with Dr. Dahlquist if this ID is truly a unique >>>one-off, or is representative of a pattern that we'll want to >>>exclude. Either way, this ID can be omitted by using "not" or >>>"<>" or possibly "not like" or "not ~" (check PostgreSQL where >>>clause syntax to see where the negation can be applied). >>>John David N. Dionisio, PhD >>>Associate Professor, Computer Science >>>Loyola Marymount University >>> >>>On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >>> > actually i had a typo (emailing from desktop system but testing >>> on my laptop... typed correctly here but wrong in pgadmin) but >>> the results make much more sense now. >>> > >>> > >>> > select count (*) from genenametype where (type = 'ordered >>> locus' or type = 'ORF') and value like 'Rv%'; >>> > returns 4058 >>> > >>> > select count (*) from genenametype where (type = 'ordered >>> locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > returns 4058 >>> > >>> > >>> -------------------------------------------------------------------------------------------------------------------------- >>> >>> > >>> > >>> > Continuing forward - >>> > >>> > The testing report says that 4066 unique matches exist in XML >>> but 6 of them were eliminated by Dr. D leaving the desired number at 4060. >>> > >>> > So now we are only 2 genes short with the query returning >>> 4058... which is also (conveniently) the sum of the two separate >>> queries of 'ordered locus' and 'ORF' respectively. >>> > >>> > But recall that Dr. D said that only 69 genes of the missing >>> 75 were tagged 'ORF' but we seem to have 1 extra gene tagged >>> 'ORF' than we expected. Adding that into missing genes puts us 3 short... >>> > >>> > Should I make the changes to the code and export a gdb so that >>> analysis can be done or wait until we work this through further? >>> > >>> > Richard >>> > >>> > >>> > >>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio >>> <<mailto:do...@lm...>do...@lm...> wrote: >>> > Hi Rich, >>> > >>> > The second form should have worked actually. What exactly was >>> the error? >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >>> > >>> > > hmm not taking parenthesis where I thought they should go... >>> syntax error >>> > > >>> > > select count (*) from genenametype where type = ('ordered >>> locus' or 'ORF') and value like 'Rv%'; >>> > > also tried >>> > > select count (*) from genenametype where (type = 'ordered >>> locus' or type = 'ORF') and value like 'Rv%'; >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > > ah yes... i see it... >>> > > >>> > > >>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio >>> <<mailto:do...@lm...>do...@lm...> wrote: >>> > > Watch your parentheses: "and" has greater precedence than "or" :) >>> > > >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > > >>> > >> OK, so here are my query results from raw SQL: >>> > >> >>> > >> 1. using: like 'Rv%' >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered >>> locus' and value like 'Rv%'; >>> > >> returns 3988 >>> > >> >>> > >> select count (*) from genenametype where type = 'ORF' and >>> value like 'Rv%'; >>> > >> returns 70 >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered >>> locus' or type = 'ORF' and value like 'Rv%'; >>> > >> returns 7011 >>> > >> >>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered >>> locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 3988 >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered >>> locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 7011 >>> > >> >>> > >> select count (*) from genenametype where type = 'ORF' and >>> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 70 >>> > >> >>> > >> Conclusions: >>> > >> >>> > >> 1. It seems that querying for type = 'ORF' alone surfaces >>> the 69 genes were were looking for plus one more (maybe the count >>> for missing genes is off by 1?). >>> > >> >>> > >> 2. Combining the two types in a single query did not produce >>> the results that I expected (7011? - how did that happen????) so >>> this is likely not our solution... unless of course the query >>> syntax isn't actually doing what I think it is... >>> > >> >>> > >> 3. I would think the best course of action is to serialy run >>> two separate queries to capture all the required genes, then >>> removing the one unneeded gene if its truly not wanted. >>> > >> >>> > >> What do you think? >>> > >> >>> > >> Richard >>> > >> >>> > >> >>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio >>> <<mailto:do...@lm...>do...@lm...> wrote: >>> > >> I don't recall the exact details of the missing 69, but if >>> your query successfully returns them in raw SQL, then this is >>> worth a try. You can integrate into the same query as long as >>> the same columns are returned, which is the case here AFAIK, so >>> go ahead and extend the existing query. >>> > >> >>> > >> >>> > >> John David N. Dionisio, PhD >>> > >> Associate Professor, Computer Science >>> > >> Loyola Marymount University >>> > >> >>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > >> >>> > >>> So here is the appropriate code snippet from >>> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > >>> public >>> > >>> >>> > >>> TableManager >>> getSystemTableManagerCustomizations(TableManager tableManager, >>> TableManager primarySystemTableManager, Date version) throws >>> SQLException, InvalidParameterException { >>> > >>> >>> > >>> // Build the base query; we only use "ordered locus" and we >>> only want >>> > >>> >>> > >>> // IDs that begin with "Rv." >>> > >>> PreparedStatement ps = >>> ConnectionManager.getRelationalDBConnection().prepareStatement( >>> > >>> >>> > >>> "SELECT value, type " + >>> > >>> >>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>> > >>> >>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>> > >>> >>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>> entrytype_gene_hjid = ?"); >>> > >>> ResultSet result; >>> > >>> >>> > >>> >>> > >>> >>> > >>> for (Row row : primarySystemTableManager.getRows()) { >>> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >>> > >>> >>> > >>> "UID"))); >>> > >>> result = ps.executeQuery(); >>> > >>> >>> > >>> >>> > >>> >>> > >>> // We actually want to keep the case where multiple ordered locus >>> > >>> >>> > >>> // names appear. >>> > >>> >>> > >>> while (result.next()) { >>> > >>> >>> > >>> // We want this name to appear in the OrderedLocusNames >>> > >>> >>> > >>> // system table. >>> > >>> >>> > >>> for (String id : result.getString("value").split("/")) { >>> > >>> tableManager.submit( >>> > >>> >>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { >>> "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { >>> "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) >>> }, { "UID", row.getValue("UID") } }); >>> > >>> } >>> > >>> >>> > >>> } >>> > >>> >>> > >>> } >>> > >>> >>> > >>> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> > >>> So now we want to build the base query which uses "ordered >>> locus" and "orf" and we only want IDs that begin with "Rv". >>> > >>> >>> > >>> I know there are more comprehensive ways to search for gene >>> ID's by matching gene ID prefix but "like Rv%" seemed to work >>> thus far, we just need to tell it to search for XML tag type orf >>> in addition to ordered locus. >>> > >>> >>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value >>> like 'Rv%' and entrytype_gene_hjid = ? " >>> > >>> >>> > >>> Here is a stab at it.... This part of our class was right >>> as the server went down and my submission for week 6 assignment I >>> can't seem to find. >>> > >>> >>> > >>> Is it possible to have two different types in the same >>> query or should we rewrite a separate query for the orf tag? >>> > >>> >>> > >>> Richard >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > >>> >>> > >>> thanks and will do as directed. >>> > >>> >>> > >>> My previous, last paragraph comment - A way for programming >>> code in email holding its format in a mail message similarly to >>> how you can post code on forum pages? >>> > >>> >>> > >>> <code> >>> > >>> blah >>> > >>> blah >>> > >>> blah >>> > >>> </code> >>> > >>> >>> > >>> thanks! >>> > >>> >>> > >>> Richard >>> > >>> >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio >>> <<mailto:do...@lm...>do...@lm...> wrote: >>> > >>> >>> > >>> Greetings, >>> > >>> >>> > >>> Actually, gmbuilder.properties is for the TallyEngine >>> only. When dealing with .gdb exports, look *only* at the >>> SpeciesProfile class. So, to find those 69 IDs, it is the >>> SpeciesProfile code, and *only* the SpeciesProfile code, that >>> needs to be changed. >>> > >>> >>> > >>> Your take on how gmbuilder.properties is used, however, is >>> understandable. It makes sense to assume that the TallyEngine >>> code *and* the ID export code are based on the same >>> characterization of the needed IDs. This replication is >>> originally a historical artifact: SpeciesProfile was done first, >>> and then TallyEngine was done later by another student. >>> > >>> >>> > >>> However, there are other factors beyond history that sort >>> of necessitate this duplication of desired IDs: (skip the two >>> bullets below if you'd rather cut to the chase of the work to be >>> done, and discuss design issues later) >>> > >>> >>> > >>> - The actual XML import code is a black box: this is the >>> "canned" JAXB library actually in action, and not our code at >>> all. Plus, the XML import code really does not filter (nor >>> should it), since the goal of the XML->relational database step >>> is to fully capture the XML data in the relational database. So, >>> XML count is necessarily separated from XML import. >>> > >>> >>> > >>> - The notion of a declarative mechanism for extracting IDs >>> from the relational database (which is what >>> gmbuilder.properties/TallyEngine uses) is interesting, but at the >>> same time there is value in the arbitrary computation that can be >>> done with Java (case in point: export two versions of an ID, with >>> and without periods). This is not to say that it is impossible >>> to do this declaratively, but let's just say that the procedural >>> approach exists here and now, and a declarative approach will >>> need more thought. >>> > >>> >>> > >>> These, and other factors, are good thoughts to hold onto >>> and would be worthy of a good meeting discussion sometime, but >>> bottom line for now: modifying the export behavior is a matter of >>> editing the *SpeciesProfile* Java code, and not the >>> gmbuilder.properties file. Turn your attention to that code. >>> > >>> >>> > >>> Now, as to annotating your code...I'd just put in code >>> comments :) Or did you mean something else by tagging code in e-mail? >>> > >>> >>> > >>> John David N. Dionisio, PhD >>> > >>> Associate Professor, Computer Science >>> > >>> Loyola Marymount University >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>> > >>> >>> > >>> > also, how do I tag code in email so it holds its >>> formatting? I tried a few suggestions I found on the web but they >>> aren't holding formatting or i'm just doing it wrong ;-D >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > >>> > OK, have some updates and some suggestions: >>> > >>> > >>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the >>> gene testing report. We verified that XML match does indeed find >>> 4066 unique matches - 75 of which are not in the gdb and need to be. >>> > >>> > >>> > >>> > Dr. Dahlquist informed me that she was the one who >>> completed the gene db testing report, not a previous student of >>> BIO367 and had already verified which genes were missing and >>> where they were to be found. I had (mistakenly) assumed that >>> since a student had performed the gene database testing I had to >>> redo all of the verification. >>> > >>> > >>> > >>> > So that said, of the 75 genes missing - 69 need to be >>> included and 6 excluded. >>> > >>> > Per the gene db testing report: "69 of them have an "a", >>> "b", or "d" suffix. They are all found in the ORF tag and need to >>> be included in the gdb." >>> > >>> > >>> > >>> > To solve this we need to add additional search criteria >>> into the M. tuberculosis section in gmbuilder.properties below: >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_level_amount= >>> > >>> > >>> > >>> > 1 >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_element_level0= >>> > >>> > >>> > >>> > uniprot/entry/gene/name&type&ordered locus >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > >>> > >>> > select count(*) from genenametype where type = 'ordered >>> locus' and value like 'Rv%'; >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > >>> > >>> > Ordered Locus >>> > >>> > SOLUTIONS: >>> > >>> > >>> > >>> > 1. So am i correct in my understanding that the second >>> line is the query used by TallyEngine to read the XML file? If so >>> then this is the issue we need to table for the moment until we >>> get the gbd verified and re-released. We will revisit this to >>> discover why it is not only reporting incorrectly but also why >>> its added a second row of Ordered Locus on the TallyEngine results page. >>> > >>> > >>> > >>> > 2. The third line is the SQL query used by postgres >>> during the export from XML to gdb. To find and get the ORF tagged >>> genes could we not add the following lines and change the count >>> in the first line: >>> > >>> > >>> > >>> > >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>> > >>> > >>> > >>> > >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > >>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>> > >>> > >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > >>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) >>> from genenametype where type = 'orf'; >>> > >>> > >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > >>> > >>> > Ordered Locus >>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>> > >>> > >>> > >>> > >>> ---------------------------------------------------------------------------------------------------------------------------- >>> >>> > >>> > >>> > >>> > Of course these queries would have be manually verified >>> prior to making these changes but this seems like we are moving >>> in the right direction. >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > >>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > >>> > Just got done reading previous email and understand the >>> change in priority. >>> > >>> > >>> > >>> > Will work on the missing ID's for now and shelve the the >>> TalleyEngine issue for the moment. >>> > >>> > >>> > >>> > Also great about a more formalized weekly meeting. I was >>> going to suggest it myself as it has been slow going so far as >>> maybe i'm a bit too independent in this independent study class =D >>> > >>> > >>> > >>> > Will dig further into the missing ID's later tonight and >>> during day tomorrow and report back. >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio >>> <<mailto:do...@lm...>do...@lm...> wrote: >>> > >>> > Hi Rich, >>> > >>> > >>> > >>> > No problem. The pertinent line you're referring to, for >>> XML, is this, right above the line you copied: >>> > >>> > >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > >>> > >>> > The slash-separated section is the "path" of XML tags >>> leading to the element of interest; then, after the ampersand, is >>> a name/value pair for the desired attribute to count. Note that >>> there is no hint of a *content*-based filter (nor is there the >>> capability for one, as far as I can tell in the code). By >>> "content," I mean that we can't specify filters based on what's >>> *between* the tags. We can only go as far as filter by attribute >>> value, e.g., type="ordered locus". >>> > >>> > >>> > >>> > But anyway, as mentioned in the earlier e-mail, let's >>> have the missing IDs in the .gdb take precedence for now. Please >>> take a look at the tuberculosis, A. thaliana, and P. falciparum >>> profiles to get an idea for how the ID output can be customized, >>> then let me know if you have any questions or need to confirm anything. >>> > >>> > >>> > >>> > John David N. Dionisio, PhD >>> > >>> > Associate Professor, Computer Science >>> > >>> > Loyola Marymount University >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> > >>> > >>> > >>> > > Sorry been slammed with a programming assignment that >>> kept needing continued iteration and it has been all consuming >>> until last night. But I did get a chance to work with your >>> comments and review the code again with a different mind set. >>> > >>> > > >>> > >>> > > Yes, I examined the gmbuilder.properties file ( the >>> query is also in the >>> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >>> in a previous email ) but I don't think I see what you mean >>> regarding the XML count. >>> > >>> > > >>> > >>> > > I understood that: >>> mycobacteriumtuberculosis_query_level0=select count(*) from >>> genenametype where type = 'ordered locus' and value like >>> 'Rv%'; was the db query but don't see which is the XML count... >>> or do they share the same query and you are saying that XML count >>> doesn't recognize and therefore cannot use the 'Rv%' parameter? >>> > >>> > > >>> > >>> > > Richard >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. >>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>> > >>> > > Hi Rich, >>> > >>> > > >>> > >>> > > Sorry for the delay. Had some distractions coming into >>> the weekend. >>> > >>> > > >>> > >>> > > You've looked at the code; have you looked at >>> gmbuilder.properties? (I may have mentioned it a few e-mails >>> ago, just as you were starting to dig into this) >>> > >>> > > >>> > >>> > > On the copy I have, the M. tuberculosis block looks >>> like this (indentation is mine to set it apart): >>> > >>> > > >>> > >>> > > # Mycobacterium tuberculosis >>> > >>> > > mycobacteriumtuberculosis_level_amount=1 >>> > >>> > > >>> > >>> > > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > > >>> > >>> > > mycobacteriumtuberculosis_query_level0=select >>> count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > > >>> > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > >>> > > >>> > >>> > > There, I think, is the rub. Notice that the XML count >>> does not filter on RV%. The SQL query does. >>> > >>> > > >>> > >>> > > Unfortunately, I don't think the TallyEngine can >>> include selective filtering in the XML counts. If the need to do >>> selective filtering on XML is necessary, then I think we're >>> looking at a new functionality for you to implement (or, if this >>> throws things off too much, this may have to be noted somewhere, >>> that the XML vs. database counts may be off because the database >>> count is doing some text-based filtering but the XML count does not). >>> > >>> > > >>> > >>> > > What does xmlpipedb-match say? That will at least tell >>> you whether the 'RV%' count is indeed correct. >>> > >>> > > >>> > >>> > > John David N. Dionisio, PhD >>> > >>> > > Associate Professor, Computer Science >>> > >>> > > Loyola Marymount University >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > >>> > > >>> > >>> > > > OK here is what I was able to put together from the >>> past few hours of code review: >>> > >>> > > > >>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > >>> > > > -reveals that after the 2 System table modifications >>> are made adding species name and link, a PreparedStatement is >>> instantiated which builds and calls the base query. >>> > >>> > > > >>> > >>> > > > -The base query called is: ("SELECT value, type " + >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>> entrytype_gene_hjid = ?") >>> > >>> > > > >>> > >>> > > > -So its looking in 'ordered locus' table/column for >>> any tuple that starts with Rv (followed by any substring) and >>> entrytype_gene_hjid = ? . >>> > >>> > > > The 'like' comparator and % usage are clear with the >>> 'type' entrytype_gene_hjid = ? >>> > >>> > > > >>> > >>> > > > -To me it seems the query makes sense so the problem >>> is likely elsewhere. >>> > >>> > > > >>> > >>> > > > GenMappBuilder.java: >>> > >>> > > > -I found method doTallies() at code line 895 which: >>> > >>> > > > Instantiates a Configuration called >>> hibernateConfiguration and assigns to it the current hibernate configuration >>> > >>> > > > Validates database settings by analyzing >>> hibernateConfiguration >>> > >>> > > > Instantiates a CriterionList for uniprot and assigns >>> to it TallyType.UNIPROT >>> > >>> > > > Instantiates a CriterionList for go and assigns to it >>> TallyType.GO >>> > >>> > > > Determines if both xml files exist >>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase >>> are run on both xml files and their respective CriterionList >>> > >>> > > > Results are then formatted for display in a table. >>> > >>> > > > >>> > >>> > > > -So enum TallyType which means that they are the only >>> valid datatypes which TallyEngine accepts... go to know ... >>> > >>> > > > >>> > >>> > > > -Based on the screen shot of Tally Engine it would >>> seem that both getTallyResultsXML() and getTallyResultsDatabase() >>> are incorrectly returning. Likely due to both using an incorrect >>> query (as we previously supposed). But where are the queries?... >>> the more I dig the more I think they are in the criterial all the >>> work is done against. >>> > >>> > > > >>> > >>> > > > continuing the review: >>> > >>> > > > getTallyResultsXML() calls Tally Engine instance >>> method getXmlFileCounts(xmlFile) >>> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance >>> method getDbcounts(new QueryEngine(hibernateConfiguration) >>> > >>> > > > Both of these instanced methods originate from >>> TallyEngine.java... >>> > >>> > > > >>> > >>> > > > TallyEngine.java: >>> > >>> > > > >>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which >>> instantiates a digester then processes against criteria... but >>> this quickly becomes confusing and is hard to follow >>> > >>> > > > >>> > >>> > > > getDbcounts() then starts a db session and executes a >>> query but then I also get a bit lost with my limited db knowledge. >>> > >>> > > > >>> > >>> > > > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> >>> > >>> > > > >>> > >>> > > > OVERALL I think I'm getting closer to the issues but >>> I still feel as if I'm missing some understanding to proceed >>> further. Can you pass along some of that Dondi insight and steer >>> me in the right direction? =D >>> > >>> > > > >>> > >>> > > > -DB Tally - Not having taken databases yet certainly >>> is limiting my ability determine where the "criteria" are being >>> set and how they are followed during session activities. Also is >>> the query we have been looking for this whole time in the >>> criteria or someplace else? >>> > >>> > > > >>> > >>> > > > -XML Tally - again is the query contained within the >>> criteria that digestXmlFile() uses to parse? >>> > >>> > > > >>> > >>> > > > Richard >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. >>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>> > >>> > > > Right, schema issues are unlikely. Most count >>> discrepancies like this that I've seen have boiled down to >>> forming the right query. Then, knowing the right query (in both >>> XML and SQL), it's a matter of making sure that TallyEngine asks >>> that same query. >>> > >>> > > > >>> > >>> > > > John David N. Dionisio, PhD >>> > >>> > > > Associate Professor, Computer Science >>> > >>> > > > Loyola Marymount University >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > >>> > > > >>> > >>> > > > > OK, so based on your approach: >>> > >>> > > > > >>> > >>> > > > > 1. I'll start with reviewing the queries for >>> xmlpipedb-match and sql queries needed for the respective results >>> as you requested. >>> > >>> > > > > >>> > >>> > > > > I was also thinking I may need to review the schema >>> from xml into postgres but the issue isn't likely a schema error. >>> The error most likely lies in how xmlpipedbutils queries the data >>> from xml source and writes to the tables what it returns? >>> > >>> > > > > >>> > >>> > > > > 2. I'll review the code: trace the entrance of >>> tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >>> > >>> > > > > >>> > >>> > > > > Richard >>> > >>> > > > > >>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. >>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the >>> first e-mail) --- the XMLPipeDB Utilities source code is in >>> trunk/xmlpipedbutils in SourceForge's Subversion repo. >>> > >>> > > > > >>> > >>> > > > > John David N. Dionisio, PhD >>> > >>> > > > > Associate Professor, Computer Science >>> > >>> > > > > Loyola Marymount University >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > >>> > > > > >>> > >>> > > > > > Hi Dondi, >>> > >>> > > > > > >>> > >>> > > > > > So I'm at the point in working with M >>> tuberculosis that I was able to exactly reproduce Dr. Dahlquist's >>> problematic TallyEngine results. >>> > >>> > > > > > >>> > >>> > > > > > gmb2b60 Results >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > Now the proverbial question - What next to solve >>> the Ordered Locus import/count issue? >>> > >>> > > > > > >>> > >>> > > > > > ********************************************** >>> > >>> > > > > > Here is my thought process: >>> > >>> > > > > > >>> > >>> > > > > > Step 1: How does the import process work at the >>> high level? (obviously correct me if I'm wrong) >>> > >>> > > > > > >>> > >>> > > > > > I believe that basically as each XML tag is read, >>> it is placed in the proper Postgres table(s) based on some >>> criteria. There is also likely some sort of check that each >>> individual tag is in valid XML format unless we don't care at >>> this stage (care at export) or maybe the parser just skips over >>> and goes on to the next . >>> > >>> > > > > > >>> > >>> > > > > > Step 2: What could be the problem? >>> > >>> > > > > > >>> > >>> > > > > > Either - >>> > >>> > > > > > a. XML tags are being parsed incorrectly >>> (ignored/skipped)? >>> > >>> > > > > > b. Decision criteria of which table they should >>> be added to? >>> > >>> > > > > > >>> > >>> > > > > > ********************************************** >>> > >>> > > > > > >>> > >>> > > > > > I read on the sourceforge wiki: >>> > >>> > > > > > >>> > >>> > > > > > XMLPipeDB has a modular architecture with three >>> components that may be used separately or together. XSD-to-DB >>> reads an XSD (XML Schema Definition) and automatically generates >>> an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB >>> Utilities provides functionality for configuring the database, >>> importing data, and performing queries. GenMAPP Builder is based >>> on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene >>> Databases based on data from UniProt and Gene Ontology (GO). >>> > >>> > > > > > >>> > >>> > > > > > So I should probably start with the XMLPipeDB >>> Utilities which are where? I don't see any in the basic >>> distribution or are they not standalone and called from the command line? >>> > >>> > > > > > >>> > >>> > > > > > Thanks! >>> > >>> > > > > > >>> > >>> > > > > > Richard > > >------------------------------------------------------------------------------ >Free Software Download: Index, Search & Analyze Logs and other IT data in >Real-Time with Splunk. Collect, index and harness all the fast moving IT data >generated by your applications, servers and devices whether physical, virtual >or in the cloud. Deliver compliance at lower cost and gain new business >insights. ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |
From: John D. N. D. <do...@lm...> - 2011-02-23 04:03:11
|
Hi Rich, Thanks for the updates. Yes, as long as the slashed IDs don't appear elsewhere in the XML file (e.g., there is no other ordered locus record whose ID is Rv2562, etc.), then those IDs will indeed add to the count of unique matches in XML. Those slashed IDs don't add to the SQL count because they will be in a single record (i.e., a row whose field value is 'Rv2561/Rv2562') and so the SQL query does not count it twice. So, I generally agree with your "move forward" steps. I think we are close enough to definitively resolving this that we can wait to release. The only other thing I'll add is that, in case the decision is to split the slashed IDs into two records in the .gdb, we have encountered this 1-slashed-record-becomes-two-unslashed-records situation before, and when we did, the SpeciesProfile would split the record into to insert/export records for the .gdb. In that way, for example, Rv2561/Rv2562 becomes two records in the .gdb, with IDs Rv2561 and RV2562, respectively. If you scan the code, there should be a routine in there that does precisely this. If it's in a reusable utility method, go ahead and use that. If it's a one-off in some SpeciesProfile subclass, then it is worth refactoring this block so that it *is* reusable. Hope that generally makes sense... John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 22, 2011, at 10:31 PM, Richard Brous wrote: > OK I found the culprits: > > There are 3 ordered locus gene ID's that have slashes. As was previously discussed it seems XML Match is counting each one twice which would inflate the count by 3. > > Rv2561/Rv2562 'ordered locus' > Rv2880c/Rv2879c 'ordered locus' > Rv3021c/Rv3022c 'ordered locus' > > And as was previously noted, Rv3346/55c 'ORF' should be excluded. > > This will bring the both XML and sql db queries counts in sync at 4057. > > By my math: > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by slashes = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > > TO MOVE FORWARD: > > 1. We need to decide how to address the slash ID's (should we keep and split into separate tuples or should they be excluded) > > 2. Adjust the queries to reflect what is needed and test with raw sql. > > 3. Update the queries on MycobacteriumTuberculosisUniProtSpeciesProfile.java > > 4. Export a new Mtb gene database for testing. > > > Richard > > > > On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I don't know why the numbers are off. I think the only way to find out is to get the results of the Postgres query and line it up next to the match results and see what is different. If you can send me the lists of IDs from the Postgres query and the match results, I can check them. > > Best, > Dr. D > > > At 04:57 PM 2/22/2011, you wrote: >> Here is where I am on the numbers: >> ------------------------------------------------------------------------------------------------------------ >> XML tags read by Match: >> 4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes >> ------------------------------------------------------------------------------------------------------------ >> sql query of the db tables: >> 3988 ordered locus + 69 ORF = total 4057 genes >> ------------------------------------------------------------------------------------------------------------ >> So by my count we are off by 3 genes >> >> Maybe XML Match is counting slashed genes as 2 separate genes? >> >> IE. so if it encountered 3 slashed genes it would in effect be counting those 3 as 6? >> >> Richard >> >> On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: >> Hi, >> >> I compared this list of gene IDs with the list I had on the Testing Report on the wiki and found the following. >> >> Your list has 70 IDs. 69 of them are identical to what is in my list of IDs in the Testing report. >> >> Indeed, the only difference is the one with the slash. "Rv3346/55c" I looked this ID up at UniProt.org as Rv3346. It appears in the record for UniProt ID O50384. For that UniProt record, the ID referred to in the OrderedLocus tag is Rv3355c which does appear in the gdb already. Looking up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real gene ID. So this entire record with the slash in it needs to be excluded and all the rest of them need to be captured. >> >> I think then, that the numbers should add up correctly, is this true? >> >> Best, >> Dr. D >> >> >> At 01:09 PM 2/22/2011, Richard Brous wrote: >>> attachement now included... >>> >>> On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous <rbr...@gm...> wrote: >>> Here is an export of the genes found using: select * from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. >>> >>> 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >>> 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >>> 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >>> 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >>> 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >>> 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >>> 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >>> 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >>> 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >>> 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >>> 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >>> 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >>> 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >>> 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >>> 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >>> 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >>> 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >>> 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >>> 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >>> 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >>> 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >>> 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >>> 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >>> 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >>> 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >>> 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >>> 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >>> 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >>> 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >>> 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >>> 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >>> 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >>> 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >>> 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >>> 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >>> 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >>> 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >>> 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >>> 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >>> 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >>> 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >>> 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >>> 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >>> 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >>> 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >>> 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >>> 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >>> 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >>> 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >>> 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >>> 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >>> 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >>> 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >>> 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >>> 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >>> 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >>> 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >>> 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >>> 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >>> 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >>> 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >>> 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >>> 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >>> 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >>> 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >>> 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >>> 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >>> 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >>> 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >>> 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 >>> >>> *****The item of note I see is the gene with the slash separating gene id's which refer to the same gene. >>> >>> Richard >>> >>> On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...> wrote: >>> Understood. >>> >>> I'll check in with Dr. D in the afternoon tomorrow and discuss. >>> >>> Richard >>> >>> On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> wrote: >>> Hi Rich, >>> Addressing the release business first, let's put it this way: if the remaining loose ends can be addressed by tomorrow, we can probably wait until then. If unexpected snags are encountered, then it would be worthwhile to release whatever you have. >>> With that in mind, considering that you pretty much know the patterns of the IDs that are needed, I think it will only take a little digital forensic work now to figure out exactly which IDs are still needed. Once you know what those are, you should: >>> 1. Find where they are in the XML file. >>> 2. Knowing the XML location, find the corresponding table in the relational database (table names are generally derived from tag/element names). >>> 3. Knowing the table in the database, write or extend the SpeciesProfile query to retrieve that data. >>> For the ID that must *not* be included, again it's a matter of tracking down what this ID is. Knowing this straggler, you can then consult with Dr. Dahlquist if this ID is truly a unique one-off, or is representative of a pattern that we'll want to exclude. Either way, this ID can be omitted by using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL where clause syntax to see where the negation can be applied). >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> >>> On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >>> > actually i had a typo (emailing from desktop system but testing on my laptop... typed correctly here but wrong in pgadmin) but the results make much more sense now. >>> > >>> > >>> > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; >>> > returns 4058 >>> > >>> > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > returns 4058 >>> > >>> > -------------------------------------------------------------------------------------------------------------------------- >>> > >>> > >>> > Continuing forward - >>> > >>> > The testing report says that 4066 unique matches exist in XML but 6 of them were eliminated by Dr. D leaving the desired number at 4060. >>> > >>> > So now we are only 2 genes short with the query returning 4058... which is also (conveniently) the sum of the two separate queries of 'ordered locus' and 'ORF' respectively. >>> > >>> > But recall that Dr. D said that only 69 genes of the missing 75 were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding that into missing genes puts us 3 short... >>> > >>> > Should I make the changes to the code and export a gdb so that analysis can be done or wait until we work this through further? >>> > >>> > Richard >>> > >>> > >>> > >>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> wrote: >>> > Hi Rich, >>> > >>> > The second form should have worked actually. What exactly was the error? >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >>> > >>> > > hmm not taking parenthesis where I thought they should go... syntax error >>> > > >>> > > select count (*) from genenametype where type = ('ordered locus' or 'ORF') and value like 'Rv%'; >>> > > also tried >>> > > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> wrote: >>> > > ah yes... i see it... >>> > > >>> > > >>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> wrote: >>> > > Watch your parentheses: "and" has greater precedence than "or" :) >>> > > >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: >>> > > >>> > >> OK, so here are my query results from raw SQL: >>> > >> >>> > >> 1. using: like 'Rv%' >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >> returns 3988 >>> > >> >>> > >> select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; >>> > >> returns 70 >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value like 'Rv%'; >>> > >> returns 7011 >>> > >> >>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 3988 >>> > >> >>> > >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 7011 >>> > >> >>> > >> select count (*) from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>> > >> returns 70 >>> > >> >>> > >> Conclusions: >>> > >> >>> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were were looking for plus one more (maybe the count for missing genes is off by 1?). >>> > >> >>> > >> 2. Combining the two types in a single query did not produce the results that I expected (7011? - how did that happen????) so this is likely not our solution... unless of course the query syntax isn't actually doing what I think it is... >>> > >> >>> > >> 3. I would think the best course of action is to serialy run two separate queries to capture all the required genes, then removing the one unneeded gene if its truly not wanted. >>> > >> >>> > >> What do you think? >>> > >> >>> > >> Richard >>> > >> >>> > >> >>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: >>> > >> I don't recall the exact details of the missing 69, but if your query successfully returns them in raw SQL, then this is worth a try. You can integrate into the same query as long as the same columns are returned, which is the case here AFAIK, so go ahead and extend the existing query. >>> > >> >>> > >> >>> > >> John David N. Dionisio, PhD >>> > >> Associate Professor, Computer Science >>> > >> Loyola Marymount University >>> > >> >>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: >>> > >> >>> > >>> So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > >>> public >>> > >>> >>> > >>> TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { >>> > >>> >>> > >>> // Build the base query; we only use "ordered locus" and we only want >>> > >>> >>> > >>> // IDs that begin with "Rv." >>> > >>> PreparedStatement ps = ConnectionManager.getRelationalDBConnection().prepareStatement( >>> > >>> >>> > >>> "SELECT value, type " + >>> > >>> >>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>> > >>> >>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>> > >>> >>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); >>> > >>> ResultSet result; >>> > >>> >>> > >>> >>> > >>> >>> > >>> for (Row row : primarySystemTableManager.getRows()) { >>> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >>> > >>> >>> > >>> "UID"))); >>> > >>> result = ps.executeQuery(); >>> > >>> >>> > >>> >>> > >>> >>> > >>> // We actually want to keep the case where multiple ordered locus >>> > >>> >>> > >>> // names appear. >>> > >>> >>> > >>> while (result.next()) { >>> > >>> >>> > >>> // We want this name to appear in the OrderedLocusNames >>> > >>> >>> > >>> // system table. >>> > >>> >>> > >>> for (String id : result.getString("value").split("/")) { >>> > >>> tableManager.submit( >>> > >>> >>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", row.getValue("UID") } }); >>> > >>> } >>> > >>> >>> > >>> } >>> > >>> >>> > >>> } >>> > >>> >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> > >>> So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". >>> > >>> >>> > >>> I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. >>> > >>> >>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " >>> > >>> >>> > >>> Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. >>> > >>> >>> > >>> Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? >>> > >>> >>> > >>> Richard >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: >>> > >>> >>> > >>> thanks and will do as directed. >>> > >>> >>> > >>> My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? >>> > >>> >>> > >>> <code> >>> > >>> blah >>> > >>> blah >>> > >>> blah >>> > >>> </code> >>> > >>> >>> > >>> thanks! >>> > >>> >>> > >>> Richard >>> > >>> >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...> wrote: >>> > >>> >>> > >>> Greetings, >>> > >>> >>> > >>> Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. >>> > >>> >>> > >>> Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. >>> > >>> >>> > >>> However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) >>> > >>> >>> > >>> - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. >>> > >>> >>> > >>> - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. >>> > >>> >>> > >>> These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. >>> > >>> >>> > >>> Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? >>> > >>> >>> > >>> John David N. Dionisio, PhD >>> > >>> Associate Professor, Computer Science >>> > >>> Loyola Marymount University >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>> > >>> >>> > >>> > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: >>> > >>> > OK, have some updates and some suggestions: >>> > >>> > >>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. >>> > >>> > >>> > >>> > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. >>> > >>> > >>> > >>> > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. >>> > >>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." >>> > >>> > >>> > >>> > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_level_amount= >>> > >>> > >>> > >>> > 1 >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_element_level0= >>> > >>> > >>> > >>> > uniprot/entry/gene/name&type&ordered locus >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > >>> > >>> > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > >>> > >>> > Ordered Locus >>> > >>> > SOLUTIONS: >>> > >>> > >>> > >>> > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. >>> > >>> > >>> > >>> > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: >>> > >>> > >>> > >>> > >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>> > >>> > >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > >>> > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>> > >>> > >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > >>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; >>> > >>> > >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > >>> > >>> > Ordered Locus >>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>> > >>> > >>> > >>> > ---------------------------------------------------------------------------------------------------------------------------- >>> > >>> > >>> > >>> > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > >>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: >>> > >>> > Just got done reading previous email and understand the change in priority. >>> > >>> > >>> > >>> > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. >>> > >>> > >>> > >>> > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D >>> > >>> > >>> > >>> > Will dig further into the missing ID's later tonight and during day tomorrow and report back. >>> > >>> > >>> > >>> > Richard >>> > >>> > >>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: >>> > >>> > Hi Rich, >>> > >>> > >>> > >>> > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: >>> > >>> > >>> > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > >>> > >>> > >>> > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". >>> > >>> > >>> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. >>> > >>> > >>> > >>> > John David N. Dionisio, PhD >>> > >>> > Associate Professor, Computer Science >>> > >>> > Loyola Marymount University >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> > >>> > >>> > >>> > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. >>> > >>> > > >>> > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. >>> > >>> > > >>> > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? >>> > >>> > > >>> > >>> > > Richard >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: >>> > >>> > > Hi Rich, >>> > >>> > > >>> > >>> > > Sorry for the delay. Had some distractions coming into the weekend. >>> > >>> > > >>> > >>> > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) >>> > >>> > > >>> > >>> > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): >>> > >>> > > >>> > >>> > > # Mycobacterium tuberculosis >>> > >>> > > mycobacteriumtuberculosis_level_amount=1 >>> > >>> > > >>> > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > >>> > > >>> > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > > >>> > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > >>> > > >>> > >>> > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. >>> > >>> > > >>> > >>> > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). >>> > >>> > > >>> > >>> > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. >>> > >>> > > >>> > >>> > > John David N. Dionisio, PhD >>> > >>> > > Associate Professor, Computer Science >>> > >>> > > Loyola Marymount University >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > >>> > > >>> > >>> > > > OK here is what I was able to put together from the past few hours of code review: >>> > >>> > > > >>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > >>> > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. >>> > >>> > > > >>> > >>> > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>> > >>> > > > >>> > >>> > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>> > >>> > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? >>> > >>> > > > >>> > >>> > > > -To me it seems the query makes sense so the problem is likely elsewhere. >>> > >>> > > > >>> > >>> > > > GenMappBuilder.java: >>> > >>> > > > -I found method doTallies() at code line 895 which: >>> > >>> > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration >>> > >>> > > > Validates database settings by analyzing hibernateConfiguration >>> > >>> > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT >>> > >>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>> > >>> > > > Determines if both xml files exist >>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList >>> > >>> > > > Results are then formatted for display in a table. >>> > >>> > > > >>> > >>> > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... >>> > >>> > > > >>> > >>> > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. >>> > >>> > > > >>> > >>> > > > continuing the review: >>> > >>> > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) >>> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) >>> > >>> > > > Both of these instanced methods originate from TallyEngine.java... >>> > >>> > > > >>> > >>> > > > TallyEngine.java: >>> > >>> > > > >>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow >>> > >>> > > > >>> > >>> > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. >>> > >>> > > > >>> > >>> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > >>> > > > >>> > >>> > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D >>> > >>> > > > >>> > >>> > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? >>> > >>> > > > >>> > >>> > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? >>> > >>> > > > >>> > >>> > > > Richard >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: >>> > >>> > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. >>> > >>> > > > >>> > >>> > > > John David N. Dionisio, PhD >>> > >>> > > > Associate Professor, Computer Science >>> > >>> > > > Loyola Marymount University >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > >>> > > > >>> > >>> > > > > OK, so based on your approach: >>> > >>> > > > > >>> > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. >>> > >>> > > > > >>> > >>> > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? >>> > >>> > > > > >>> > >>> > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >>> > >>> > > > > >>> > >>> > > > > Richard >>> > >>> > > > > >>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: >>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. >>> > >>> > > > > >>> > >>> > > > > John David N. Dionisio, PhD >>> > >>> > > > > Associate Professor, Computer Science >>> > >>> > > > > Loyola Marymount University >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > >>> > > > > >>> > >>> > > > > > Hi Dondi, >>> > >>> > > > > > >>> > >>> > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>> > >>> > > > > > >>> > >>> > > > > > gmb2b60 Results >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? >>> > >>> > > > > > >>> > >>> > > > > > ********************************************** >>> > >>> > > > > > Here is my thought process: >>> > >>> > > > > > >>> > >>> > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) >>> > >>> > > > > > >>> > >>> > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . >>> > >>> > > > > > >>> > >>> > > > > > Step 2: What could be the problem? >>> > >>> > > > > > >>> > >>> > > > > > Either - >>> > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>> > >>> > > > > > b. Decision criteria of which table they should be added to? >>> > >>> > > > > > >>> > >>> > > > > > ********************************************** >>> > >>> > > > > > >>> > >>> > > > > > I read on the sourceforge wiki: >>> > >>> > > > > > >>> > >>> > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). >>> > >>> > > > > > >>> > >>> > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? >>> > >>> > > > > > >>> > >>> > > > > > Thanks! >>> > >>> > > > > > >>> > >>> > > > > > Richard > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-02-23 03:32:02
|
OK I found the culprits: There are 3 ordered locus gene ID's that have slashes. As was previously discussed it seems XML Match is counting each one twice which would inflate the count by 3. Rv2561/Rv2562 'ordered locus' Rv2880c/Rv2879c 'ordered locus' Rv3021c/Rv3022c 'ordered locus' And as was previously noted, Rv3346/55c 'ORF' should be excluded. This will bring the both XML and sql db queries counts in sync at 4057. By my math: ------------------------------------------------------------------------------------------------------------ XML tags read by Match: 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by slashes = total 4057 genes ------------------------------------------------------------------------------------------------------------ sql query of the db tables: 3988 ordered locus + 69 ORF = total 4057 genes ------------------------------------------------------------------------------------------------------------ TO MOVE FORWARD: 1. We need to decide how to address the slash ID's (should we keep and split into separate tuples or should they be excluded) 2. Adjust the queries to reflect what is needed and test with raw sql. 3. Update the queries on MycobacteriumTuberculosisUniProtSpeciesProfile.java 4. Export a new Mtb gene database for testing. Richard On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I don't know why the numbers are off. I think the only way to find out is > to get the results of the Postgres query and line it up next to the match > results and see what is different. If you can send me the lists of IDs from > the Postgres query and the match results, I can check them. > > Best, > Dr. D > > > At 04:57 PM 2/22/2011, you wrote: > > Here is where I am on the numbers: > > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes > > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > > ------------------------------------------------------------------------------------------------------------ > So by my count we are off by 3 genes > > Maybe XML Match is counting slashed genes as 2 separate genes? > > IE. so if it encountered 3 slashed genes it would in effect be counting > those 3 as 6? > > Richard > > On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I compared this list of gene IDs with the list I had on the Testing Report > on the wiki and found the following. > > Your list has 70 IDs. 69 of them are identical to what is in my list of > IDs in the Testing report. > > Indeed, the only difference is the one with the slash. "Rv3346/55c" I > looked this ID up at UniProt.org as Rv3346. It appears in the record for > UniProt ID O50384. For that UniProt record, the ID referred to in the > OrderedLocus tag is Rv3355c which does appear in the gdb already. Looking > up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real > gene ID. So this entire record with the slash in it needs to be excluded > and all the rest of them need to be captured. > > I think then, that the numbers should add up correctly, is this true? > > Best, > Dr. D > > > At 01:09 PM 2/22/2011, Richard Brous wrote: > > attachement now included... > > On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous <rbr...@gm...> wrote: > Here is an export of the genes found using: select * from genenametype > where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also > attached as a csv file. > 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 > 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 > 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 > 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 > 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 > 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 > 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 > 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 > 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 > 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 > 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 > 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 > 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 > 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 > 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 > 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 > 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 > 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 > 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 > 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 > 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 > 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 > 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 > 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 > 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 > 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 > 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 > 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 > 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 > 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 > 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 > 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 > 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 > 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 > 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 > 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 > 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 > 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 > 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 > 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 > 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 > 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 > 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 > 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 > 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 > 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 > 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 > 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 > 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 > 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 > 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 > 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 > 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 > 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 > 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 > 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 > 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 > 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 > 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 > 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 > 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 > 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 > 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 > 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 > 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 > 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 > 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 > 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 > 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 > 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 > > *****The item of note I see is the gene with the slash separating gene id's > which refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 > PM, Richard Brous <rbr...@gm...> wrote: Understood. I'll check in > with Dr. D in the afternoon tomorrow and discuss. Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> > wrote: Hi Rich, > Addressing the release business first, let's put it this way: if the > remaining loose ends can be addressed by tomorrow, we can probably wait > until then. If unexpected snags are encountered, then it would be > worthwhile to release whatever you have. > With that in mind, considering that you pretty much know the patterns of > the IDs that are needed, I think it will only take a little digital forensic > work now to figure out exactly which IDs are still needed. Once you know > what those are, you should: > 1. Find where they are in the XML file. 2. Knowing the XML location, find > the corresponding table in the relational database (table names are > generally derived from tag/element names). 3. Knowing the table in the > database, write or extend the SpeciesProfile query to retrieve that data. > For the ID that must *not* be included, again it's a matter of tracking > down what this ID is. Knowing this straggler, you can then consult with Dr. > Dahlquist if this ID is truly a unique one-off, or is representative of a > pattern that we'll want to exclude. Either way, this ID can be omitted by > using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL > where clause syntax to see where the negation can be applied). > John David N. Dionisio, PhD Associate Professor, Computer Science Loyola > Marymount University > > > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > > actually i had a typo (emailing from desktop system but testing on my > laptop... typed correctly here but wrong in pgadmin) but the results make > much more sense now. > > > select count (*) from genenametype where (type > = 'ordered locus' or type = 'ORF') and value like 'Rv%'; > returns 4058 > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > > > -------------------------------------------------------------------------------------------------------------------------- > > > > Continuing forward - > > The testing report says that 4066 unique > matches exist in XML but 6 of them were eliminated by Dr. D leaving the > desired number at 4060. > > So now we are only 2 genes short with the > query returning 4058... which is also (conveniently) the sum of the two > separate queries of 'ordered locus' and 'ORF' respectively. > > But > recall that Dr. D said that only 69 genes of the missing 75 were tagged > 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding > that into missing genes puts us 3 short... > > Should I make the changes > to the code and export a gdb so that analysis can be done or wait until we > work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at > 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The > second form should have worked actually. What exactly was the error? > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard > Brous wrote: > > > hmm not taking parenthesis where I thought they should > go... syntax error > > > > select count (*) from genenametype where type = > ('ordered locus' or 'ORF') and value like 'Rv%'; > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 > at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see > it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio > <do...@lm...> wrote: > > Watch your parentheses: "and" has greater > precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> > wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> > 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where > type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > > >> returns 70 > >> > >> select count (*) from genenametype where type = > 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 > > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > >> select count (*) from genenametype where type = 'ordered locus' > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 > genes were were looking for plus one more (maybe the count for missing genes > is off by 1?). > >> > >> 2. Combining the two types in a single query did > not produce the results that I expected (7011? - how did that happen????) so > this is likely not our solution... unless of course the query syntax isn't > actually doing what I think it is... > >> > >> 3. I would think the best > course of action is to serialy run two separate queries to capture all the > required genes, then removing the one unneeded gene if its truly not wanted. > > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb > 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I > don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, > Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, > 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So > here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > >>> > >>> // Build the base > query; we only use "ordered locus" and we only want > >>> > >>> // IDs > that begin with "Rv." > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN > entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = > entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and > value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; > > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) > { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually > want to keep the case where multiple ordered locus > >>> > >>> // names > appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this > name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > > >>> for (String id : result.getString("value").split("/")) { > >>> > tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, > new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, > { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { > "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there > are more comprehensive ways to search for gene ID's by matching gene ID > prefix but "like Rv%" seemed to work thus far, we just need to tell it to > search for XML tag type orf in addition to ordered locus. > >>> > >>> > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part > of our class was right as the server went down and my submission for week 6 > assignment I can't seem to find. > >>> > >>> Is it possible to have two > different types in the same query or should we rewrite a separate query for > the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, > 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>> > thanks and will do as directed. > >>> > >>> My previous, last paragraph > comment - A way for programming code in email holding its format in a mail > message similarly to how you can post code on forum pages? > >>> > >>> > <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David > N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> > Actually, gmbuilder.properties is for the TallyEngine only. When dealing > with .gdb exports, look *only* at the SpeciesProfile class. So, to find > those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile > code, that needs to be changed. > >>> > >>> Your take on how > gmbuilder.properties is used, however, is understandable. It makes sense to > assume that the TallyEngine code *and* the ID export code are based on the > same characterization of the needed IDs. This replication is originally a > historical artifact: SpeciesProfile was done first, and then TallyEngine was > done later by another student. > >>> > >>> However, there are other > factors beyond history that sort of necessitate this duplication of desired > IDs: (skip the two bullets below if you'd rather cut to the chase of the > work to be done, and discuss design issues later) > >>> > >>> - The actual > XML import code is a black box: this is the "canned" JAXB library actually > in action, and not our code at all. Plus, the XML import code really does > not filter (nor should it), since the goal of the XML->relational database > step is to fully capture the XML data in the relational database. So, XML > count is necessarily separated from XML import. > >>> > >>> - The notion > of a declarative mechanism for extracting IDs from the relational database > (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at > the same time there is value in the arbitrary computation that can be done > with Java (case in point: export two versions of an ID, with and without > periods). This is not to say that it is impossible to do this > declaratively, but let's just say that the procedural approach exists here > and now, and a declarative approach will need more thought. > >>> > >>> > These, and other factors, are good thoughts to hold onto and would be worthy > of a good meeting discussion sometime, but bottom line for now: modifying > the export behavior is a matter of editing the *SpeciesProfile* Java code, > and not the gmbuilder.properties file. Turn your attention to that code. > > >>> > >>> Now, as to annotating your code...I'd just put in code comments > :) Or did you mean something else by tagging code in e-mail? > >>> > >>> > John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, > 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag > code in email so it holds its formatting? I tried a few suggestions I found > on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, > Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates > and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat > down and reviewed the gene testing report. We verified that XML match does > indeed find 4066 unique matches - 75 of which are not in the gdb and need to > be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who > completed the gene db testing report, not a previous student of BIO367 and > had already verified which genes were missing and where they were to be > found. I had (mistakenly) assumed that since a student had performed the > gene database testing I had to redo all of the verification. > >>> > > >>> > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > >>> > Per the gene db testing report: "69 of them have an "a", > "b", or "d" suffix. They are all found in the ORF tag and need to be > included in the gdb." > >>> > > >>> > To solve this we need to add > additional search criteria into the M. tuberculosis section in > gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my > understanding that the second line is the query used by TallyEngine to read > the XML file? If so then this is the issue we need to table for the moment > until we get the gbd verified and re-released. We will revisit this to > discover why it is not only reporting incorrectly but also why its added a > second row of Ordered Locus on the TallyEngine results page. > >>> > > >>> > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > >>> > > >>> > > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > >>> > Of course these queries would have be manually verified > prior to making these changes but this seems like we are moving in the right > direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, > 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just > got done reading previous email and understand the change in priority. > > >>> > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more > formalized weekly meeting. I was going to suggest it myself as it has been > slow going so far as maybe i'm a bit too independent in this independent > study class =D > >>> > > >>> > Will dig further into the missing ID's > later tonight and during day tomorrow and report back. > >>> > > >>> > > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. > Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No > problem. The pertinent line you're referring to, for XML, is this, right > above the line you copied: > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > >>> > The slash-separated section is the "path" of XML > tags leading to the element of interest; then, after the ampersand, is a > name/value pair for the desired attribute to count. Note that there is no > hint of a *content*-based filter (nor is there the capability for one, as > far as I can tell in the code). By "content," I mean that we can't specify > filters based on what's *between* the tags. We can only go as far as filter > by attribute value, e.g., type="ordered locus". > >>> > > >>> > But > anyway, as mentioned in the earlier e-mail, let's have the missing IDs in > the .gdb take precedence for now. Please take a look at the tuberculosis, > A. thaliana, and P. falciparum profiles to get an idea for how the ID output > can be customized, then let me know if you have any questions or need to > confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous > wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment > that kept needing continued iteration and it has been all consuming until > last night. But I did get a chance to work with your comments and review the > code again with a different mind set. > >>> > > > >>> > > Yes, I examined > the gmbuilder.properties file ( the query is also in the > MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > >>> > > > >>> > > I understood that: > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; was the db query but > don't see which is the XML count... or do they share the same query and you > are saying that XML count doesn't recognize and therefore cannot use the > 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the > delay. Had some distractions coming into the weekend. > >>> > > > >>> > > > You've looked at the code; have you looked at gmbuilder.properties? (I may > have mentioned it a few e-mails ago, just as you were starting to dig into > this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block > looks like this (indentation is mine to set it apart): > >>> > > > >>> > > > # Mycobacterium tuberculosis > >>> > > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > > >>> > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does > not filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, > I don't think the TallyEngine can include selective filtering in the XML > counts. If the need to do selective filtering on XML is necessary, then I > think we're looking at a new functionality for you to implement (or, if this > throws things off too much, this may have to be noted somewhere, that the > XML vs. database counts may be off because the database count is doing some > text-based filtering but the XML count does not). > >>> > > > >>> > > What > does xmlpipedb-match say? That will at least tell you whether the 'RV%' > count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount > University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at > 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was > able to put together from the past few hours of code review: > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > > -reveals that after the 2 System table modifications are made adding species > name and link, a PreparedStatement is instantiated which builds and calls > the base query. > >>> > > > > >>> > > > -The base query called is: > ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " > + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE > type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for > any tuple that starts with Rv (followed by any substring) and > entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage > are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method > doTallies() at code line 895 which: > >>> > > > Instantiates a > Configuration called hibernateConfiguration and assigns to it the current > hibernate configuration > >>> > > > Validates database settings by > analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList > for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a > CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines > if both xml files exist > >>> > > > Then getTallyResultsXML and > getTallyResultsDatabase are run on both xml files and their respective > CriterionList > >>> > > > Results are then formatted for display in a > table. > >>> > > > > >>> > > > -So enum TallyType which means that they > are the only valid datatypes which TallyEngine accepts... go to know ... > > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would > seem that both getTallyResultsXML() and getTallyResultsDatabase() are > incorrectly returning. Likely due to both using an incorrect query (as we > previously supposed). But where are the queries?... the more I dig the more > I think they are in the criterial all the work is done against. > >>> > > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() > calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new > QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced > methods originate from TallyEngine.java... > >>> > > > > >>> > > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls > digestXmlFile() which instantiates a digester then processes against > criteria... but this quickly becomes confusing and is hard to follow > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a > query but then I also get a bit lost with my limited db knowledge. > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues > but I still feel as if I'm missing some understanding to proceed further. > Can you pass along some of that Dondi insight and steer me in the right > direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken > databases yet certainly is limiting my ability determine where the > "criteria" are being set and how they are followed during session > activities. Also is the query we have been looking for this whole time in > the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again > is the query contained within the criteria that digestXmlFile() uses to > parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> > wrote: > >>> > > > Right, schema issues are unlikely. Most count > discrepancies like this that I've seen have boiled down to forming the right > query. Then, knowing the right query (in both XML and SQL), it's a matter > of making sure that TallyEngine asks that same query. > >>> > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer > Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. > I'll start with reviewing the queries for xmlpipedb-match and sql queries > needed for the respective results as you requested. > >>> > > > > > >>> > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the > code: trace the entrance of tally engine in the gmbuilder code then follow > it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. > Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm > (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source > code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate > Professor, Computer Science > >>> > > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at > 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > >>> > > > > > So I'm at the point in working with M > tuberculosis that I was able to exactly reproduce Dr. Dahlquist's > problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 > Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > Now the proverbial question - What next to solve the Ordered Locus > import/count issue? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > Here is my > thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the > import process work at the high level? (obviously correct me if I'm wrong) > > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is > read, it is placed in the proper Postgres table(s) based on some criteria. > There is also likely some sort of check that each individual tag is in valid > XML format unless we don't care at this stage (care at export) or maybe the > parser just skips over and goes on to the next . > >>> > > > > > > >>> > > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly > (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table > they should be added to? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > > >>> > > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > > XMLPipeDB has a modular architecture with three components that may be used > separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and > automatically generates an SQL schema, Java classes, and Hibernate mappings. > XMLPipeDB Utilities provides functionality for configuring the database, > importing data, and performing queries. GenMAPP Builder is based on the > XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on > data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I > don't see any in the basic distribution or are they not standalone and > called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > >>> > > > > > Richard > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-02-23 01:12:49
|
Hi, I don't know why the numbers are off. I think the only way to find out is to get the results of the Postgres query and line it up next to the match results and see what is different. If you can send me the lists of IDs from the Postgres query and the match results, I can check them. Best, Dr. D At 04:57 PM 2/22/2011, you wrote: >Here is where I am on the numbers: >------------------------------------------------------------------------------------------------------------ >XML tags read by Match: >4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes >------------------------------------------------------------------------------------------------------------ >sql query of the db tables: >3988 ordered locus + 69 ORF = total 4057 genes >------------------------------------------------------------------------------------------------------------ >So by my count we are off by 3 genes > >Maybe XML Match is counting slashed genes as 2 separate genes? > >IE. so if it encountered 3 slashed genes it would in effect be >counting those 3 as 6? > >Richard > >On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: >Hi, > >I compared this list of gene IDs with the list I had on the Testing >Report on the wiki and found the following. > >Your list has 70 IDs. 69 of them are identical to what is in my >list of IDs in the Testing report. > >Indeed, the only difference is the one with the >slash. "Rv3346/55c" I looked this ID up at UniProt.org as >Rv3346. It appears in the record for UniProt ID O50384. For that >UniProt record, the ID referred to in the OrderedLocus tag is >Rv3355c which does appear in the gdb already. Looking up Rv3346 on >Tuberculist and the Stanford TB database, Rv3346 is not a real gene >ID. So this entire record with the slash in it needs to be excluded >and all the rest of them need to be captured. > >I think then, that the numbers should add up correctly, is this true? > >Best, >Dr. D > > >At 01:09 PM 2/22/2011, Richard Brous wrote: >>attachement now included... >> >>On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous >><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>Here is an export of the genes found using: select * from >>genenametype where type = 'ORF' and value ~ >>'[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. >> >>647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >>5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >>647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >>647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >>647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >>647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >>211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >>648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >>648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >>648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >>648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >>648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >>649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >>649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >>649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >>649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >>650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >>174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >>174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >>312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >>331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >>445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >>621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >>622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >>622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >>622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >>622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >>623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >>623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >>623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >>623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >>624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >>624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >>625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >>625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >>626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >>626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >>626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >>626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >>627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >>627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >>627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >>627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >>627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >>627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >>627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >>628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >>629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >>629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >>629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >>630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >>630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >>631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >>632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >>632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >>633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >>633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >>645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >>645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >>645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >>645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >>645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >>646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >>646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >>646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >>646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >>646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >>646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >>647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >>647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 >> >>*****The item of note I see is the gene with the slash separating >>gene id's which refer to the same gene. >> >>Richard >> >>On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous >><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>Understood. >> >>I'll check in with Dr. D in the afternoon tomorrow and discuss. >> >>Richard >> >>On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio >><<mailto:do...@lm...>do...@lm...> wrote: >>Hi Rich, >>Addressing the release business first, let's put it this way: if >>the remaining loose ends can be addressed by tomorrow, we can >>probably wait until then. If unexpected snags are encountered, >>then it would be worthwhile to release whatever you have. >>With that in mind, considering that you pretty much know the >>patterns of the IDs that are needed, I think it will only take a >>little digital forensic work now to figure out exactly which IDs >>are still needed. Once you know what those are, you should: >>1. Find where they are in the XML file. >>2. Knowing the XML location, find the corresponding table in the >>relational database (table names are generally derived from >>tag/element names). >>3. Knowing the table in the database, write or extend the >>SpeciesProfile query to retrieve that data. >>For the ID that must *not* be included, again it's a matter of >>tracking down what this ID is. Knowing this straggler, you can >>then consult with Dr. Dahlquist if this ID is truly a unique >>one-off, or is representative of a pattern that we'll want to >>exclude. Either way, this ID can be omitted by using "not" or "<>" >>or possibly "not like" or "not ~" (check PostgreSQL where clause >>syntax to see where the negation can be applied). >>John David N. Dionisio, PhD >>Associate Professor, Computer Science >>Loyola Marymount University >> >> >>On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >> > actually i had a typo (emailing from desktop system but testing >> on my laptop... typed correctly here but wrong in pgadmin) but the >> results make much more sense now. >> > >> > >> > select count (*) from genenametype where (type = 'ordered locus' >> or type = 'ORF') and value like 'Rv%'; >> > returns 4058 >> > >> > select count (*) from genenametype where (type = 'ordered locus' >> or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > returns 4058 >> > >> > >> -------------------------------------------------------------------------------------------------------------------------- >> >> > >> > >> > Continuing forward - >> > >> > The testing report says that 4066 unique matches exist in XML >> but 6 of them were eliminated by Dr. D leaving the desired number at 4060. >> > >> > So now we are only 2 genes short with the query returning >> 4058... which is also (conveniently) the sum of the two separate >> queries of 'ordered locus' and 'ORF' respectively. >> > >> > But recall that Dr. D said that only 69 genes of the missing 75 >> were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' >> than we expected. Adding that into missing genes puts us 3 short... >> > >> > Should I make the changes to the code and export a gdb so that >> analysis can be done or wait until we work this through further? >> > >> > Richard >> > >> > >> > >> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > Hi Rich, >> > >> > The second form should have worked actually. What exactly was the error? >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >> > >> > > hmm not taking parenthesis where I thought they should go... >> syntax error >> > > >> > > select count (*) from genenametype where type = ('ordered >> locus' or 'ORF') and value like 'Rv%'; >> > > also tried >> > > select count (*) from genenametype where (type = 'ordered >> locus' or type = 'ORF') and value like 'Rv%'; >> > > >> > > >> > > >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > > ah yes... i see it... >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > > Watch your parentheses: "and" has greater precedence than "or" :) >> > > >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > On Feb 21, 2011, at 7:59 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > > >> > >> OK, so here are my query results from raw SQL: >> > >> >> > >> 1. using: like 'Rv%' >> > >> >> > >> select count (*) from genenametype where type = 'ordered >> locus' and value like 'Rv%'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and >> value like 'Rv%'; >> > >> returns 70 >> > >> >> > >> select count (*) from genenametype where type = 'ordered >> locus' or type = 'ORF' and value like 'Rv%'; >> > >> returns 7011 >> > >> >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >> > >> >> > >> select count (*) from genenametype where type = 'ordered >> locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ordered >> locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 7011 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and >> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 70 >> > >> >> > >> Conclusions: >> > >> >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the >> 69 genes were were looking for plus one more (maybe the count for >> missing genes is off by 1?). >> > >> >> > >> 2. Combining the two types in a single query did not produce >> the results that I expected (7011? - how did that happen????) so >> this is likely not our solution... unless of course the query >> syntax isn't actually doing what I think it is... >> > >> >> > >> 3. I would think the best course of action is to serialy run >> two separate queries to capture all the required genes, then >> removing the one unneeded gene if its truly not wanted. >> > >> >> > >> What do you think? >> > >> >> > >> Richard >> > >> >> > >> >> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > >> I don't recall the exact details of the missing 69, but if >> your query successfully returns them in raw SQL, then this is >> worth a try. You can integrate into the same query as long as the >> same columns are returned, which is the case here AFAIK, so go >> ahead and extend the existing query. >> > >> >> > >> >> > >> John David N. Dionisio, PhD >> > >> Associate Professor, Computer Science >> > >> Loyola Marymount University >> > >> >> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > >> >> > >>> So here is the appropriate code snippet from >> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> public >> > >>> >> > >>> TableManager >> getSystemTableManagerCustomizations(TableManager tableManager, >> TableManager primarySystemTableManager, Date version) throws >> SQLException, InvalidParameterException { >> > >>> >> > >>> // Build the base query; we only use "ordered locus" and we only want >> > >>> >> > >>> // IDs that begin with "Rv." >> > >>> PreparedStatement ps = >> ConnectionManager.getRelationalDBConnection().prepareStatement( >> > >>> >> > >>> "SELECT value, type " + >> > >>> >> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >> > >>> >> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> > >>> >> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >> entrytype_gene_hjid = ?"); >> > >>> ResultSet result; >> > >>> >> > >>> >> > >>> >> > >>> for (Row row : primarySystemTableManager.getRows()) { >> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >> > >>> >> > >>> "UID"))); >> > >>> result = ps.executeQuery(); >> > >>> >> > >>> >> > >>> >> > >>> // We actually want to keep the case where multiple ordered locus >> > >>> >> > >>> // names appear. >> > >>> >> > >>> while (result.next()) { >> > >>> >> > >>> // We want this name to appear in the OrderedLocusNames >> > >>> >> > >>> // system table. >> > >>> >> > >>> for (String id : result.getString("value").split("/")) { >> > >>> tableManager.submit( >> > >>> >> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { >> "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { >> "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) >> }, { "UID", row.getValue("UID") } }); >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> > >>> So now we want to build the base query which uses "ordered >> locus" and "orf" and we only want IDs that begin with "Rv". >> > >>> >> > >>> I know there are more comprehensive ways to search for gene >> ID's by matching gene ID prefix but "like Rv%" seemed to work thus >> far, we just need to tell it to search for XML tag type orf in >> addition to ordered locus. >> > >>> >> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value >> like 'Rv%' and entrytype_gene_hjid = ? " >> > >>> >> > >>> Here is a stab at it.... This part of our class was right as >> the server went down and my submission for week 6 assignment I >> can't seem to find. >> > >>> >> > >>> Is it possible to have two different types in the same query >> or should we rewrite a separate query for the orf tag? >> > >>> >> > >>> Richard >> > >>> >> > >>> >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > >>> >> > >>> thanks and will do as directed. >> > >>> >> > >>> My previous, last paragraph comment - A way for programming >> code in email holding its format in a mail message similarly to >> how you can post code on forum pages? >> > >>> >> > >>> <code> >> > >>> blah >> > >>> blah >> > >>> blah >> > >>> </code> >> > >>> >> > >>> thanks! >> > >>> >> > >>> Richard >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > >>> >> > >>> Greetings, >> > >>> >> > >>> Actually, gmbuilder.properties is for the TallyEngine >> only. When dealing with .gdb exports, look *only* at the >> SpeciesProfile class. So, to find those 69 IDs, it is the >> SpeciesProfile code, and *only* the SpeciesProfile code, that >> needs to be changed. >> > >>> >> > >>> Your take on how gmbuilder.properties is used, however, is >> understandable. It makes sense to assume that the TallyEngine >> code *and* the ID export code are based on the same >> characterization of the needed IDs. This replication is >> originally a historical artifact: SpeciesProfile was done first, >> and then TallyEngine was done later by another student. >> > >>> >> > >>> However, there are other factors beyond history that sort of >> necessitate this duplication of desired IDs: (skip the two bullets >> below if you'd rather cut to the chase of the work to be done, and >> discuss design issues later) >> > >>> >> > >>> - The actual XML import code is a black box: this is the >> "canned" JAXB library actually in action, and not our code at >> all. Plus, the XML import code really does not filter (nor should >> it), since the goal of the XML->relational database step is to >> fully capture the XML data in the relational database. So, XML >> count is necessarily separated from XML import. >> > >>> >> > >>> - The notion of a declarative mechanism for extracting IDs >> from the relational database (which is what >> gmbuilder.properties/TallyEngine uses) is interesting, but at the >> same time there is value in the arbitrary computation that can be >> done with Java (case in point: export two versions of an ID, with >> and without periods). This is not to say that it is impossible to >> do this declaratively, but let's just say that the procedural >> approach exists here and now, and a declarative approach will need >> more thought. >> > >>> >> > >>> These, and other factors, are good thoughts to hold onto and >> would be worthy of a good meeting discussion sometime, but bottom >> line for now: modifying the export behavior is a matter of editing >> the *SpeciesProfile* Java code, and not the gmbuilder.properties >> file. Turn your attention to that code. >> > >>> >> > >>> Now, as to annotating your code...I'd just put in code >> comments :) Or did you mean something else by tagging code in e-mail? >> > >>> >> > >>> John David N. Dionisio, PhD >> > >>> Associate Professor, Computer Science >> > >>> Loyola Marymount University >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> > >>> >> > >>> > also, how do I tag code in email so it holds its >> formatting? I tried a few suggestions I found on the web but they >> aren't holding formatting or i'm just doing it wrong ;-D >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > >>> > OK, have some updates and some suggestions: >> > >>> > >> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the >> gene testing report. We verified that XML match does indeed find >> 4066 unique matches - 75 of which are not in the gdb and need to be. >> > >>> > >> > >>> > Dr. Dahlquist informed me that she was the one who >> completed the gene db testing report, not a previous student of >> BIO367 and had already verified which genes were missing and where >> they were to be found. I had (mistakenly) assumed that since a >> student had performed the gene database testing I had to redo all >> of the verification. >> > >>> > >> > >>> > So that said, of the 75 genes missing - 69 need to be >> included and 6 excluded. >> > >>> > Per the gene db testing report: "69 of them have an "a", >> "b", or "d" suffix. They are all found in the ORF tag and need to >> be included in the gdb." >> > >>> > >> > >>> > To solve this we need to add additional search criteria >> into the M. tuberculosis section in gmbuilder.properties below: >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount= >> > >>> > >> > >>> > 1 >> > >>> > >> > >>> > mycobacteriumtuberculosis_element_level0= >> > >>> > >> > >>> > uniprot/entry/gene/name&type&ordered locus >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered >> locus' and value like 'Rv%'; >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > SOLUTIONS: >> > >>> > >> > >>> > 1. So am i correct in my understanding that the second >> line is the query used by TallyEngine to read the XML file? If so >> then this is the issue we need to table for the moment until we >> get the gbd verified and re-released. We will revisit this to >> discover why it is not only reporting incorrectly but also why its >> added a second row of Ordered Locus on the TallyEngine results page. >> > >>> > >> > >>> > 2. The third line is the SQL query used by postgres during >> the export from XML to gdb. To find and get the ORF tagged genes >> could we not add the following lines and change the count in the first line: >> > >>> > >> > >>> > >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount=2 >> > >>> > >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered locus'; >> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) >> from genenametype where type = 'orf'; >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >>> > >> > >>> > >> ---------------------------------------------------------------------------------------------------------------------------- >> >> > >>> > >> > >>> > Of course these queries would have be manually verified >> prior to making these changes but this seems like we are moving in >> the right direction. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > >>> > Just got done reading previous email and understand the >> change in priority. >> > >>> > >> > >>> > Will work on the missing ID's for now and shelve the the >> TalleyEngine issue for the moment. >> > >>> > >> > >>> > Also great about a more formalized weekly meeting. I was >> going to suggest it myself as it has been slow going so far as >> maybe i'm a bit too independent in this independent study class =D >> > >>> > >> > >>> > Will dig further into the missing ID's later tonight and >> during day tomorrow and report back. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > >>> > Hi Rich, >> > >>> > >> > >>> > No problem. The pertinent line you're referring to, for >> XML, is this, right above the line you copied: >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> > >>> > The slash-separated section is the "path" of XML tags >> leading to the element of interest; then, after the ampersand, is >> a name/value pair for the desired attribute to count. Note that >> there is no hint of a *content*-based filter (nor is there the >> capability for one, as far as I can tell in the code). By >> "content," I mean that we can't specify filters based on what's >> *between* the tags. We can only go as far as filter by attribute >> value, e.g., type="ordered locus". >> > >>> > >> > >>> > But anyway, as mentioned in the earlier e-mail, let's have >> the missing IDs in the .gdb take precedence for now. Please take >> a look at the tuberculosis, A. thaliana, and P. falciparum >> profiles to get an idea for how the ID output can be customized, >> then let me know if you have any questions or need to confirm anything. >> > >>> > >> > >>> > John David N. Dionisio, PhD >> > >>> > Associate Professor, Computer Science >> > >>> > Loyola Marymount University >> > >>> > >> > >>> > >> > >>> > >> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >>> > >> > >>> > > Sorry been slammed with a programming assignment that >> kept needing continued iteration and it has been all consuming >> until last night. But I did get a chance to work with your >> comments and review the code again with a different mind set. >> > >>> > > >> > >>> > > Yes, I examined the gmbuilder.properties file ( the >> query is also in the >> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >> in a previous email ) but I don't think I see what you mean >> regarding the XML count. >> > >>> > > >> > >>> > > I understood that: >> mycobacteriumtuberculosis_query_level0=select count(*) from >> genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... >> or do they share the same query and you are saying that XML count >> doesn't recognize and therefore cannot use the 'Rv%' parameter? >> > >>> > > >> > >>> > > Richard >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > >>> > > Hi Rich, >> > >>> > > >> > >>> > > Sorry for the delay. Had some distractions coming into >> the weekend. >> > >>> > > >> > >>> > > You've looked at the code; have you looked at >> gmbuilder.properties? (I may have mentioned it a few e-mails ago, >> just as you were starting to dig into this) >> > >>> > > >> > >>> > > On the copy I have, the M. tuberculosis block looks like >> this (indentation is mine to set it apart): >> > >>> > > >> > >>> > > # Mycobacterium tuberculosis >> > >>> > > mycobacteriumtuberculosis_level_amount=1 >> > >>> > > >> > >>> > > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > >>> > > >> > >>> > > There, I think, is the rub. Notice that the XML count >> does not filter on RV%. The SQL query does. >> > >>> > > >> > >>> > > Unfortunately, I don't think the TallyEngine can include >> selective filtering in the XML counts. If the need to do >> selective filtering on XML is necessary, then I think we're >> looking at a new functionality for you to implement (or, if this >> throws things off too much, this may have to be noted somewhere, >> that the XML vs. database counts may be off because the database >> count is doing some text-based filtering but the XML count does not). >> > >>> > > >> > >>> > > What does xmlpipedb-match say? That will at least tell >> you whether the 'RV%' count is indeed correct. >> > >>> > > >> > >>> > > John David N. Dionisio, PhD >> > >>> > > Associate Professor, Computer Science >> > >>> > > Loyola Marymount University >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > >>> > > >> > >>> > > > OK here is what I was able to put together from the >> past few hours of code review: >> > >>> > > > >> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> > > > -reveals that after the 2 System table modifications >> are made adding species name and link, a PreparedStatement is >> instantiated which builds and calls the base query. >> > >>> > > > >> > >>> > > > -The base query called is: ("SELECT value, type " + >> "FROM genenametype INNER JOIN entrytype_genetype " + >> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> "WHERE type = 'ordered locus' and value like 'Rv%' and >> entrytype_gene_hjid = ?") >> > >>> > > > >> > >>> > > > -So its looking in 'ordered locus' table/column for >> any tuple that starts with Rv (followed by any substring) and >> entrytype_gene_hjid = ? . >> > >>> > > > The 'like' comparator and % usage are clear with the >> 'type' entrytype_gene_hjid = ? >> > >>> > > > >> > >>> > > > -To me it seems the query makes sense so the problem >> is likely elsewhere. >> > >>> > > > >> > >>> > > > GenMappBuilder.java: >> > >>> > > > -I found method doTallies() at code line 895 which: >> > >>> > > > Instantiates a Configuration called >> hibernateConfiguration and assigns to it the current hibernate configuration >> > >>> > > > Validates database settings by analyzing hibernateConfiguration >> > >>> > > > Instantiates a CriterionList for uniprot and assigns >> to it TallyType.UNIPROT >> > >>> > > > Instantiates a CriterionList for go and assigns to it >> TallyType.GO >> > >>> > > > Determines if both xml files exist >> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase >> are run on both xml files and their respective CriterionList >> > >>> > > > Results are then formatted for display in a table. >> > >>> > > > >> > >>> > > > -So enum TallyType which means that they are the only >> valid datatypes which TallyEngine accepts... go to know ... >> > >>> > > > >> > >>> > > > -Based on the screen shot of Tally Engine it would >> seem that both getTallyResultsXML() and getTallyResultsDatabase() >> are incorrectly returning. Likely due to both using an incorrect >> query (as we previously supposed). But where are the queries?... >> the more I dig the more I think they are in the criterial all the >> work is done against. >> > >>> > > > >> > >>> > > > continuing the review: >> > >>> > > > getTallyResultsXML() calls Tally Engine instance >> method getXmlFileCounts(xmlFile) >> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance >> method getDbcounts(new QueryEngine(hibernateConfiguration) >> > >>> > > > Both of these instanced methods originate from >> TallyEngine.java... >> > >>> > > > >> > >>> > > > TallyEngine.java: >> > >>> > > > >> > >>> > > > getXmlFileCounts() calls digestXmlFile() which >> instantiates a digester then processes against criteria... but >> this quickly becomes confusing and is hard to follow >> > >>> > > > >> > >>> > > > getDbcounts() then starts a db session and executes a >> query but then I also get a bit lost with my limited db knowledge. >> > >>> > > > >> > >>> > > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> >> > >>> > > > >> > >>> > > > OVERALL I think I'm getting closer to the issues but I >> still feel as if I'm missing some understanding to proceed >> further. Can you pass along some of that Dondi insight and steer >> me in the right direction? =D >> > >>> > > > >> > >>> > > > -DB Tally - Not having taken databases yet certainly >> is limiting my ability determine where the "criteria" are being >> set and how they are followed during session activities. Also is >> the query we have been looking for this whole time in the criteria >> or someplace else? >> > >>> > > > >> > >>> > > > -XML Tally - again is the query contained within the >> criteria that digestXmlFile() uses to parse? >> > >>> > > > >> > >>> > > > Richard >> > >>> > > > >> > >>> > > > >> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio >> <<mailto:do...@lm...>do...@lm...> wrote: >> > >>> > > > Right, schema issues are unlikely. Most count >> discrepancies like this that I've seen have boiled down to forming >> the right query. Then, knowing the right query (in both XML and >> SQL), it's a matter of making sure that TallyEngine asks that same query. >> > >>> > > > >> > >>> > > > John David N. Dionisio, PhD >> > >>> > > > Associate Professor, Computer Science >> > >>> > > > Loyola Marymount University >> > >>> > > > >> > >>> > > > >> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > >>> > > > >> > >>> > > > > OK, so based on your approach: >> > >>> > > > > >> > >>> > > > > 1. I'll start with reviewing the queries for >> xmlpipedb-match and sql queries needed for the respective results >> as you requested. >> > >>> > > > > >> > >>> > > > > I was also thinking I may need to review the schema >> from xml into postgres but the issue isn't likely a schema error. >> The error most likely lies in how xmlpipedbutils queries the data >> from xml source and writes to the tables what it returns? >> > >>> > > > > >> > >>> > > > > 2. I'll review the code: trace the entrance of tally >> engine in the gmbuilder code then follow it through the xmlpipedbutils. >> > >>> > > > > >> > >>> > > > > Richard >> > >>> > > > > >> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. >> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >> > >>> > > > > Just wanted to confirm (since I wasn't sure in the >> first e-mail) --- the XMLPipeDB Utilities source code is in >> trunk/xmlpipedbutils in SourceForge's Subversion repo. >> > >>> > > > > >> > >>> > > > > John David N. Dionisio, PhD >> > >>> > > > > Associate Professor, Computer Science >> > >>> > > > > Loyola Marymount University >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > >>> > > > > >> > >>> > > > > > Hi Dondi, >> > >>> > > > > > >> > >>> > > > > > So I'm at the point in working with M tuberculosis >> that I was able to exactly reproduce Dr. Dahlquist's problematic >> TallyEngine results. >> > >>> > > > > > >> > >>> > > > > > gmb2b60 Results >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > Now the proverbial question - What next to solve >> the Ordered Locus import/count issue? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > Here is my thought process: >> > >>> > > > > > >> > >>> > > > > > Step 1: How does the import process work at the >> high level? (obviously correct me if I'm wrong) >> > >>> > > > > > >> > >>> > > > > > I believe that basically as each XML tag is read, >> it is placed in the proper Postgres table(s) based on some >> criteria. There is also likely some sort of check that each >> individual tag is in valid XML format unless we don't care at this >> stage (care at export) or maybe the parser just skips over and >> goes on to the next . >> > >>> > > > > > >> > >>> > > > > > Step 2: What could be the problem? >> > >>> > > > > > >> > >>> > > > > > Either - >> > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >> > >>> > > > > > b. Decision criteria of which table they should be >> added to? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > >> > >>> > > > > > I read on the sourceforge wiki: >> > >>> > > > > > >> > >>> > > > > > XMLPipeDB has a modular architecture with three >> components that may be used separately or together. XSD-to-DB >> reads an XSD (XML Schema Definition) and automatically generates >> an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB >> Utilities provides functionality for configuring the database, >> importing data, and performing queries. GenMAPP Builder is based >> on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene >> Databases based on data from UniProt and Gene Ontology (GO). >> > >>> > > > > > >> > >>> > > > > > So I should probably start with the XMLPipeDB >> Utilities which are where? I don't see any in the basic >> distribution or are they not standalone and called from the command line? >> > >>> > > > > > >> > >>> > > > > > Thanks! >> > >>> > > > > > >> > >>> > > > > > Richard |
From: Richard B. <rbr...@gm...> - 2011-02-23 00:57:43
|
Here is where I am on the numbers: ------------------------------------------------------------------------------------------------------------ XML tags read by Match: 4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes ------------------------------------------------------------------------------------------------------------ sql query of the db tables: 3988 ordered locus + 69 ORF = total 4057 genes ------------------------------------------------------------------------------------------------------------ So by my count we are off by 3 genes Maybe XML Match is counting slashed genes as 2 separate genes? IE. so if it encountered 3 slashed genes it would in effect be counting those 3 as 6? Richard On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I compared this list of gene IDs with the list I had on the Testing Report > on the wiki and found the following. > > Your list has 70 IDs. 69 of them are identical to what is in my list of > IDs in the Testing report. > > Indeed, the only difference is the one with the slash. "Rv3346/55c" I > looked this ID up at UniProt.org as Rv3346. It appears in the record for > UniProt ID O50384. For that UniProt record, the ID referred to in the > OrderedLocus tag is Rv3355c which does appear in the gdb already. Looking > up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real > gene ID. So this entire record with the slash in it needs to be excluded > and all the rest of them need to be captured. > > I think then, that the numbers should add up correctly, is this true? > > Best, > Dr. D > > > At 01:09 PM 2/22/2011, Richard Brous wrote: > > attachement now included... > > On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous <rbr...@gm...> wrote: > Here is an export of the genes found using: select * from genenametype > where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also > attached as a csv file. > > 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 > 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 > 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 > 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 > 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 > 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 > 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 > 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 > 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 > 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 > 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 > 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 > 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 > 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 > 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 > 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 > 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 > 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 > 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 > 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 > 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 > 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 > 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 > 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 > 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 > 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 > 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 > 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 > 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 > 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 > 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 > 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 > 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 > 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 > 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 > 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 > 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 > 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 > 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 > 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 > 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 > 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 > 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 > 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 > 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 > 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 > 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 > 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 > 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 > 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 > 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 > 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 > 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 > 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 > 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 > 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 > 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 > 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 > 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 > 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 > 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 > 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 > 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 > 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 > 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 > 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 > 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 > 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 > 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 > 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 > > > *****The item of note I see is the gene with the slash separating gene id's > which refer to the same gene. > > Richard > > On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...> > wrote: > Understood. > > I'll check in with Dr. D in the afternoon tomorrow and discuss. > > Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> > wrote: > Hi Rich, > > Addressing the release business first, let's put it this way: if the > remaining loose ends can be addressed by tomorrow, we can probably wait > until then. If unexpected snags are encountered, then it would be > worthwhile to release whatever you have. > > With that in mind, considering that you pretty much know the patterns of > the IDs that are needed, I think it will only take a little digital forensic > work now to figure out exactly which IDs are still needed. Once you know > what those are, you should: > > 1. Find where they are in the XML file. > 2. Knowing the XML location, find the corresponding table in the relational > database (table names are generally derived from tag/element names). > 3. Knowing the table in the database, write or extend the SpeciesProfile > query to retrieve that data. > > For the ID that must *not* be included, again it's a matter of tracking > down what this ID is. Knowing this straggler, you can then consult with Dr. > Dahlquist if this ID is truly a unique one-off, or is representative of a > pattern that we'll want to exclude. Either way, this ID can be omitted by > using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL > where clause syntax to see where the negation can be applied). > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > > > actually i had a typo (emailing from desktop system but testing on my > laptop... typed correctly here but wrong in pgadmin) but the results make > much more sense now. > > > > > > select count (*) from genenametype where (type = 'ordered locus' or type > = 'ORF') and value like 'Rv%'; > > returns 4058 > > > > select count (*) from genenametype where (type = 'ordered locus' or type > = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > returns 4058 > > > > > -------------------------------------------------------------------------------------------------------------------------- > > > > > > Continuing forward - > > > > The testing report says that 4066 unique matches exist in XML but 6 of > them were eliminated by Dr. D leaving the desired number at 4060. > > > > So now we are only 2 genes short with the query returning 4058... which > is also (conveniently) the sum of the two separate queries of 'ordered > locus' and 'ORF' respectively. > > > > But recall that Dr. D said that only 69 genes of the missing 75 were > tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. > Adding that into missing genes puts us 3 short... > > > > Should I make the changes to the code and export a gdb so that analysis > can be done or wait until we work this through further? > > > > Richard > > > > > > > > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > The second form should have worked actually. What exactly was the error? > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > > > > > hmm not taking parenthesis where I thought they should go... syntax > error > > > > > > select count (*) from genenametype where type = ('ordered locus' or > 'ORF') and value like 'Rv%'; > > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or > type = 'ORF') and value like 'Rv%'; > > > > > > > > > > > > > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> > wrote: > > > ah yes... i see it... > > > > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> > wrote: > > > Watch your parentheses: "and" has greater precedence than "or" :) > > > > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > > > > > >> OK, so here are my query results from raw SQL: > > >> > > >> 1. using: like 'Rv%' > > >> > > >> select count (*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ORF' and value like > 'Rv%'; > > >> returns 70 > > >> > > >> select count (*) from genenametype where type = 'ordered locus' or > type = 'ORF' and value like 'Rv%'; > > >> returns 7011 > > >> > > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > > >> select count (*) from genenametype where type = 'ordered locus' and > value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ordered locus' or > type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 7011 > > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 70 > > >> > > >> Conclusions: > > >> > > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes > were were looking for plus one more (maybe the count for missing genes is > off by 1?). > > >> > > >> 2. Combining the two types in a single query did not produce the > results that I expected (7011? - how did that happen????) so this is likely > not our solution... unless of course the query syntax isn't actually doing > what I think it is... > > >> > > >> 3. I would think the best course of action is to serialy run two > separate queries to capture all the required genes, then removing the one > unneeded gene if its truly not wanted. > > >> > > >> What do you think? > > >> > > >> Richard > > >> > > >> > > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio < > do...@lm...> wrote: > > >> I don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > > >> > > >> John David N. Dionisio, PhD > > >> Associate Professor, Computer Science > > >> Loyola Marymount University > > >> > > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> > wrote: > > >> > > >>> So here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> public > > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > > >>> > > >>> // Build the base query; we only use "ordered locus" and we only want > > >>> > > >>> // IDs that begin with "Rv." > > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > > >>> > > >>> "SELECT value, type " + > > >>> > > >>> "FROM genenametype INNER JOIN entrytype_genetype " + > > >>> > > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > > >>> > > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and > entrytype_gene_hjid = ?"); > > >>> ResultSet result; > > >>> > > >>> > > >>> > > >>> for (Row row : primarySystemTableManager.getRows()) { > > >>> ps.setInt(1, Integer.parseInt(row.getValue( > > >>> > > >>> "UID"))); > > >>> result = ps.executeQuery(); > > >>> > > >>> > > >>> > > >>> // We actually want to keep the case where multiple ordered locus > > >>> > > >>> // names appear. > > >>> > > >>> while (result.next()) { > > >>> > > >>> // We want this name to appear in the OrderedLocusNames > > >>> > > >>> // system table. > > >>> > > >>> for (String id : result.getString("value").split("/")) { > > >>> tableManager.submit( > > >>> > > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, > { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", > GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", > row.getValue("UID") } }); > > >>> } > > >>> > > >>> } > > >>> > > >>> } > > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > > >>> > > >>> I know there are more comprehensive ways to search for gene ID's by > matching gene ID prefix but "like Rv%" seemed to work thus far, we just need > to tell it to search for XML tag type orf in addition to ordered locus. > > >>> > > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' > and entrytype_gene_hjid = ? " > > >>> > > >>> Here is a stab at it.... This part of our class was right as the > server went down and my submission for week 6 assignment I can't seem to > find. > > >>> > > >>> Is it possible to have two different types in the same query or > should we rewrite a separate query for the orf tag? > > >>> > > >>> Richard > > >>> > > >>> > > >>> > > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > > >>> thanks and will do as directed. > > >>> > > >>> My previous, last paragraph comment - A way for programming code in > email holding its format in a mail message similarly to how you can post > code on forum pages? > > >>> > > >>> <code> > > >>> blah > > >>> blah > > >>> blah > > >>> </code> > > >>> > > >>> thanks! > > >>> > > >>> Richard > > >>> > > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > >>> Greetings, > > >>> > > >>> Actually, gmbuilder.properties is for the TallyEngine only. When > dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to > find those 69 IDs, it is the SpeciesProfile code, and *only* the > SpeciesProfile code, that needs to be changed. > > >>> > > >>> Your take on how gmbuilder.properties is used, however, is > understandable. It makes sense to assume that the TallyEngine code *and* > the ID export code are based on the same characterization of the needed > IDs. This replication is originally a historical artifact: SpeciesProfile > was done first, and then TallyEngine was done later by another student. > > >>> > > >>> However, there are other factors beyond history that sort of > necessitate this duplication of desired IDs: (skip the two bullets below if > you'd rather cut to the chase of the work to be done, and discuss design > issues later) > > >>> > > >>> - The actual XML import code is a black box: this is the "canned" > JAXB library actually in action, and not our code at all. Plus, the XML > import code really does not filter (nor should it), since the goal of the > XML->relational database step is to fully capture the XML data in the > relational database. So, XML count is necessarily separated from XML > import. > > >>> > > >>> - The notion of a declarative mechanism for extracting IDs from the > relational database (which is what gmbuilder.properties/TallyEngine uses) is > interesting, but at the same time there is value in the arbitrary > computation that can be done with Java (case in point: export two versions > of an ID, with and without periods). This is not to say that it is > impossible to do this declaratively, but let's just say that the procedural > approach exists here and now, and a declarative approach will need more > thought. > > >>> > > >>> These, and other factors, are good thoughts to hold onto and would be > worthy of a good meeting discussion sometime, but bottom line for now: > modifying the export behavior is a matter of editing the *SpeciesProfile* > Java code, and not the gmbuilder.properties file. Turn your attention to > that code. > > >>> > > >>> Now, as to annotating your code...I'd just put in code comments :) > Or did you mean something else by tagging code in e-mail? > > >>> > > >>> John David N. Dionisio, PhD > > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > > >>> > > >>> > also, how do I tag code in email so it holds its formatting? I > tried a few suggestions I found on the web but they aren't holding > formatting or i'm just doing it wrong ;-D > > >>> > > > >>> > Richard > > >>> > > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > OK, have some updates and some suggestions: > > >>> > > > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene > testing report. We verified that XML match does indeed find 4066 unique > matches - 75 of which are not in the gdb and need to be. > > >>> > > > >>> > Dr. Dahlquist informed me that she was the one who completed the > gene db testing report, not a previous student of BIO367 and had already > verified which genes were missing and where they were to be found. I had > (mistakenly) assumed that since a student had performed the gene database > testing I had to redo all of the verification. > > >>> > > > >>> > So that said, of the 75 genes missing - 69 need to be included and > 6 excluded. > > >>> > Per the gene db testing report: "69 of them have an "a", "b", or > "d" suffix. They are all found in the ORF tag and need to be included in the > gdb." > > >>> > > > >>> > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > > >>> > > > >>> > 1 > > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > > >>> > > > >>> > uniprot/entry/gene/name&type&ordered locus > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > SOLUTIONS: > > >>> > > > >>> > 1. So am i correct in my understanding that the second line is the > query used by TallyEngine to read the XML file? If so then this is the issue > we need to table for the moment until we get the gbd verified and > re-released. We will revisit this to discover why it is not only reporting > incorrectly but also why its added a second row of Ordered Locus on the > TallyEngine results page. > > >>> > > > >>> > 2. The third line is the SQL query used by postgres during the > export from XML to gdb. To find and get the ORF tagged genes could we not > add the following lines and change the count in the first line: > > >>> > > > >>> > > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount=2 > > >>> > > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > > >>> > Of course these queries would have be manually verified prior to > making these changes but this seems like we are moving in the right > direction. > > >>> > > > >>> > Richard > > >>> > > > >>> > > > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > Just got done reading previous email and understand the change in > priority. > > >>> > > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > > >>> > > > >>> > Also great about a more formalized weekly meeting. I was going to > suggest it myself as it has been slow going so far as maybe i'm a bit too > independent in this independent study class =D > > >>> > > > >>> > Will dig further into the missing ID's later tonight and during day > tomorrow and report back. > > >>> > > > >>> > Richard > > >>> > > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > Hi Rich, > > >>> > > > >>> > No problem. The pertinent line you're referring to, for XML, is > this, right above the line you copied: > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > >>> > The slash-separated section is the "path" of XML tags leading to > the element of interest; then, after the ampersand, is a name/value pair for > the desired attribute to count. Note that there is no hint of a > *content*-based filter (nor is there the capability for one, as far as I can > tell in the code). By "content," I mean that we can't specify filters based > on what's *between* the tags. We can only go as far as filter by attribute > value, e.g., type="ordered locus". > > >>> > > > >>> > But anyway, as mentioned in the earlier e-mail, let's have the > missing IDs in the .gdb take precedence for now. Please take a look at the > tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how > the ID output can be customized, then let me know if you have any questions > or need to confirm anything. > > >>> > > > >>> > John David N. Dionisio, PhD > > >>> > Associate Professor, Computer Science > > >>> > Loyola Marymount University > > >>> > > > >>> > > > >>> > > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > >>> > > > >>> > > Sorry been slammed with a programming assignment that kept > needing continued iteration and it has been all consuming until last night. > But I did get a chance to work with your comments and review the code again > with a different mind set. > > >>> > > > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also > in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > > >>> > > > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select > count(*) from genenametype where type = 'ordered locus' and value like > 'Rv%'; was the db query but don't see which is the XML count... or do they > share the same query and you are saying that XML count doesn't recognize and > therefore cannot use the 'Rv%' parameter? > > >>> > > > > >>> > > Richard > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > Hi Rich, > > >>> > > > > >>> > > Sorry for the delay. Had some distractions coming into the > weekend. > > >>> > > > > >>> > > You've looked at the code; have you looked at > gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as > you were starting to dig into this) > > >>> > > > > >>> > > On the copy I have, the M. tuberculosis block looks like this > (indentation is mine to set it apart): > > >>> > > > > >>> > > # Mycobacterium tuberculosis > > >>> > > mycobacteriumtuberculosis_level_amount=1 > > >>> > > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > > >>> > > > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does not > filter on RV%. The SQL query does. > > >>> > > > > >>> > > Unfortunately, I don't think the TallyEngine can include > selective filtering in the XML counts. If the need to do selective > filtering on XML is necessary, then I think we're looking at a new > functionality for you to implement (or, if this throws things off too much, > this may have to be noted somewhere, that the XML vs. database counts may be > off because the database count is doing some text-based filtering but the > XML count does not). > > >>> > > > > >>> > > What does xmlpipedb-match say? That will at least tell you > whether the 'RV%' count is indeed correct. > > >>> > > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > > >>> > > Loyola Marymount University > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > >>> > > > > >>> > > > OK here is what I was able to put together from the past few > hours of code review: > > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> > > > -reveals that after the 2 System table modifications are made > adding species name and link, a PreparedStatement is instantiated which > builds and calls the base query. > > >>> > > > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM > genenametype INNER JOIN entrytype_genetype " + > "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type > = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple > that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? > . > > >>> > > > The 'like' comparator and % usage are clear with the 'type' > entrytype_gene_hjid = ? > > >>> > > > > > >>> > > > -To me it seems the query makes sense so the problem is likely > elsewhere. > > >>> > > > > > >>> > > > GenMappBuilder.java: > > >>> > > > -I found method doTallies() at code line 895 which: > > >>> > > > Instantiates a Configuration called hibernateConfiguration and > assigns to it the current hibernate configuration > > >>> > > > Validates database settings by analyzing hibernateConfiguration > > >>> > > > Instantiates a CriterionList for uniprot and assigns to it > TallyType.UNIPROT > > >>> > > > Instantiates a CriterionList for go and assigns to it > TallyType.GO > > >>> > > > Determines if both xml files exist > > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on > both xml files and their respective CriterionList > > >>> > > > Results are then formatted for display in a table. > > >>> > > > > > >>> > > > -So enum TallyType which means that they are the only valid > datatypes which TallyEngine accepts... go to know ... > > >>> > > > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that > both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly > returning. Likely due to both using an incorrect query (as we previously > supposed). But where are the queries?... the more I dig the more I think > they are in the criterial all the work is done against. > > >>> > > > > > >>> > > > continuing the review: > > >>> > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method > getDbcounts(new QueryEngine(hibernateConfiguration) > > >>> > > > Both of these instanced methods originate from > TallyEngine.java... > > >>> > > > > > >>> > > > TallyEngine.java: > > >>> > > > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a > digester then processes against criteria... but this quickly becomes > confusing and is hard to follow > > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a query but > then I also get a bit lost with my limited db knowledge. > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still > feel as if I'm missing some understanding to proceed further. Can you pass > along some of that Dondi insight and steer me in the right direction? =D > > >>> > > > > > >>> > > > -DB Tally - Not having taken databases yet certainly is > limiting my ability determine where the "criteria" are being set and how > they are followed during session activities. Also is the query we have been > looking for this whole time in the criteria or someplace else? > > >>> > > > > > >>> > > > -XML Tally - again is the query contained within the criteria > that digestXmlFile() uses to parse? > > >>> > > > > > >>> > > > Richard > > >>> > > > > > >>> > > > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > > Right, schema issues are unlikely. Most count discrepancies > like this that I've seen have boiled down to forming the right query. Then, > knowing the right query (in both XML and SQL), it's a matter of making sure > that TallyEngine asks that same query. > > >>> > > > > > >>> > > > John David N. Dionisio, PhD > > >>> > > > Associate Professor, Computer Science > > >>> > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > >>> > > > > > >>> > > > > OK, so based on your approach: > > >>> > > > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match > and sql queries needed for the respective results as you requested. > > >>> > > > > > > >>> > > > > I was also thinking I may need to review the schema from xml > into postgres but the issue isn't likely a schema error. The error most > likely lies in how xmlpipedbutils queries the data from xml source and > writes to the tables what it returns? > > >>> > > > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine > in the gmbuilder code then follow it through the xmlpipedbutils. > > >>> > > > > > > >>> > > > > Richard > > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > > > Just wanted to confirm (since I wasn't sure in the first > e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils > in SourceForge's Subversion repo. > > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > > >>> > > > > Associate Professor, Computer Science > > >>> > > > > Loyola Marymount University > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > >>> > > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I > was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine > results. > > >>> > > > > > > > >>> > > > > > gmb2b60 Results > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Now the proverbial question - What next to solve the > Ordered Locus import/count issue? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > Here is my thought process: > > >>> > > > > > > > >>> > > > > > Step 1: How does the import process work at the high level? > (obviously correct me if I'm wrong) > > >>> > > > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is > placed in the proper Postgres table(s) based on some criteria. There is also > likely some sort of check that each individual tag is in valid XML format > unless we don't care at this stage (care at export) or maybe the parser just > skips over and goes on to the next . > > >>> > > > > > > > >>> > > > > > Step 2: What could be the problem? > > >>> > > > > > > > >>> > > > > > Either - > > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > >>> > > > > > b. Decision criteria of which table they should be added > to? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > > > >>> > > > > > I read on the sourceforge wiki: > > >>> > > > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components > that may be used separately or together. XSD-to-DB reads an XSD (XML Schema > Definition) and automatically generates an SQL schema, Java classes, and > Hibernate mappings. XMLPipeDB Utilities provides functionality for > configuring the database, importing data, and performing queries. GenMAPP > Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible > Gene Databases based on data from UniProt and Gene Ontology (GO). > > >>> > > > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities > which are where? I don't see any in the basic distribution or are they not > standalone and called from the command line? > > >>> > > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > > >>> > > > > > Richard > > >>> > > > > > > >>> > > > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------ > > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > > > Pinpoint memory and threading errors before they happen. > > >>> > > > Find and fix more than 250 security defects in the development > cycle. > > >>> > > > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > > _______________________________________________ > > >>> > > > xmlpipedb-developer mailing list > > >>> > > > xml...@li... > > >>> > > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > > >>> > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------------------ > > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > > Pinpoint memory and threading errors before they happen. > > >>> > > Find and fix more than 250 security defects in the development > cycle. > > >>> > > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > _______________________________________________ > > >>> > > xmlpipedb-developer mailing list > > >>> > > xml...@li... > > >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > >>> > > <ATT00001..txt><ATT00002..txt> > > >>> > > > >>> > > > >>> > > ------------------------------------------------------------------------------ > > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > Pinpoint memory and threading errors before they happen. > > >>> > Find and fix more than 250 security defects in the development > cycle. > > >>> > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > _______________________________________________ > > >>> > xmlpipedb-developer mailing list > > >>> > xml...@li... > > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > >>> > > > >>> > > > >>> > <ATT00001..txt><ATT00002..txt> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > > >>> Pinpoint memory and threading errors before they happen. > > >>> Find and fix more than 250 security defects in the development cycle. > > >>> Locate bottlenecks in serial and parallel code that limit > performance. > > >>> http://p.sf.net/sfu/intel-dev2devfeb > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >>> Collect, index and harness all the fast moving IT data generated by > your > > >>> applications, servers and devices whether physical, virtual or in the > cloud. > > >>> Deliver compliance at lower cost and gain new business insights. > > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >> Collect, index and harness all the fast moving IT data generated by > your > > >> applications, servers and devices whether physical, virtual or in the > cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >> Collect, index and harness all the fast moving IT data generated by > your > > >> applications, servers and devices whether physical, virtual or in the > cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > ------------------------------------------------------------------------------ > > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > > Collect, index and harness all the fast moving IT data generated by > your > > > applications, servers and devices whether physical, virtual or in the > cloud. > > > Deliver compliance at lower cost and gain new business insights. > > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > Collect, index and harness all the fast moving IT data generated by your > > applications, servers and devices whether physical, virtual or in the > cloud. > > Deliver compliance at lower cost and gain new business insights. > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-02-22 23:06:17
|
Hi, I compared this list of gene IDs with the list I had on the Testing Report on the wiki and found the following. Your list has 70 IDs. 69 of them are identical to what is in my list of IDs in the Testing report. Indeed, the only difference is the one with the slash. "Rv3346/55c" I looked this ID up at UniProt.org as Rv3346. It appears in the record for UniProt ID O50384. For that UniProt record, the ID referred to in the OrderedLocus tag is Rv3355c which does appear in the gdb already. Looking up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real gene ID. So this entire record with the slash in it needs to be excluded and all the rest of them need to be captured. I think then, that the numbers should add up correctly, is this true? Best, Dr. D At 01:09 PM 2/22/2011, Richard Brous wrote: >attachement now included... > >On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous ><<mailto:rbr...@gm...>rbr...@gm...> wrote: >Here is an export of the genes found using: select * from >genenametype where type = 'ORF' and value ~ >'[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. > >647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 > > >*****The item of note I see is the gene with the slash separating >gene id's which refer to the same gene. > >Richard > >On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous ><<mailto:rbr...@gm...>rbr...@gm...> wrote: >Understood. > >I'll check in with Dr. D in the afternoon tomorrow and discuss. > >Richard > >On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio ><<mailto:do...@lm...>do...@lm...> wrote: >Hi Rich, > >Addressing the release business first, let's put it this way: if the >remaining loose ends can be addressed by tomorrow, we can probably >wait until then. If unexpected snags are encountered, then it would >be worthwhile to release whatever you have. > >With that in mind, considering that you pretty much know the >patterns of the IDs that are needed, I think it will only take a >little digital forensic work now to figure out exactly which IDs are >still needed. Once you know what those are, you should: > >1. Find where they are in the XML file. >2. Knowing the XML location, find the corresponding table in the >relational database (table names are generally derived from tag/element names). >3. Knowing the table in the database, write or extend the >SpeciesProfile query to retrieve that data. > >For the ID that must *not* be included, again it's a matter of >tracking down what this ID is. Knowing this straggler, you can then >consult with Dr. Dahlquist if this ID is truly a unique one-off, or >is representative of a pattern that we'll want to exclude. Either >way, this ID can be omitted by using "not" or "<>" or possibly "not >like" or "not ~" (check PostgreSQL where clause syntax to see where >the negation can be applied). > >John David N. Dionisio, PhD >Associate Professor, Computer Science >Loyola Marymount University > > > >On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > > > actually i had a typo (emailing from desktop system but testing > on my laptop... typed correctly here but wrong in pgadmin) but the > results make much more sense now. > > > > > > select count (*) from genenametype where (type = 'ordered locus' > or type = 'ORF') and value like 'Rv%'; > > returns 4058 > > > > select count (*) from genenametype where (type = 'ordered locus' > or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > returns 4058 > > > > > -------------------------------------------------------------------------------------------------------------------------- > > > > > > Continuing forward - > > > > The testing report says that 4066 unique matches exist in XML but > 6 of them were eliminated by Dr. D leaving the desired number at 4060. > > > > So now we are only 2 genes short with the query returning 4058... > which is also (conveniently) the sum of the two separate queries of > 'ordered locus' and 'ORF' respectively. > > > > But recall that Dr. D said that only 69 genes of the missing 75 > were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' > than we expected. Adding that into missing genes puts us 3 short... > > > > Should I make the changes to the code and export a gdb so that > analysis can be done or wait until we work this through further? > > > > Richard > > > > > > > > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > Hi Rich, > > > > The second form should have worked actually. What exactly was the error? > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > > > > > hmm not taking parenthesis where I thought they should go... syntax error > > > > > > select count (*) from genenametype where type = ('ordered > locus' or 'ORF') and value like 'Rv%'; > > > also tried > > > select count (*) from genenametype where (type = 'ordered > locus' or type = 'ORF') and value like 'Rv%'; > > > > > > > > > > > > > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > > ah yes... i see it... > > > > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > > Watch your parentheses: "and" has greater precedence than "or" :) > > > > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > > > > >> OK, so here are my query results from raw SQL: > > >> > > >> 1. using: like 'Rv%' > > >> > > >> select count (*) from genenametype where type = 'ordered > locus' and value like 'Rv%'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ORF' and > value like 'Rv%'; > > >> returns 70 > > >> > > >> select count (*) from genenametype where type = 'ordered > locus' or type = 'ORF' and value like 'Rv%'; > > >> returns 7011 > > >> > > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > > >> select count (*) from genenametype where type = 'ordered > locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ordered > locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 7011 > > >> > > >> select count (*) from genenametype where type = 'ORF' and > value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 70 > > >> > > >> Conclusions: > > >> > > >> 1. It seems that querying for type = 'ORF' alone surfaces the > 69 genes were were looking for plus one more (maybe the count for > missing genes is off by 1?). > > >> > > >> 2. Combining the two types in a single query did not produce > the results that I expected (7011? - how did that happen????) so > this is likely not our solution... unless of course the query > syntax isn't actually doing what I think it is... > > >> > > >> 3. I would think the best course of action is to serialy run > two separate queries to capture all the required genes, then > removing the one unneeded gene if its truly not wanted. > > >> > > >> What do you think? > > >> > > >> Richard > > >> > > >> > > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > >> I don't recall the exact details of the missing 69, but if > your query successfully returns them in raw SQL, then this is worth > a try. You can integrate into the same query as long as the same > columns are returned, which is the case here AFAIK, so go ahead and > extend the existing query. > > >> > > >> > > >> John David N. Dionisio, PhD > > >> Associate Professor, Computer Science > > >> Loyola Marymount University > > >> > > >> On Feb 21, 2011, at 6:56 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > >> > > >>> So here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> public > > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) > throws SQLException, InvalidParameterException { > > >>> > > >>> // Build the base query; we only use "ordered locus" and we only want > > >>> > > >>> // IDs that begin with "Rv." > > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > > >>> > > >>> "SELECT value, type " + > > >>> > > >>> "FROM genenametype INNER JOIN entrytype_genetype " + > > >>> > > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > > >>> > > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and > entrytype_gene_hjid = ?"); > > >>> ResultSet result; > > >>> > > >>> > > >>> > > >>> for (Row row : primarySystemTableManager.getRows()) { > > >>> ps.setInt(1, Integer.parseInt(row.getValue( > > >>> > > >>> "UID"))); > > >>> result = ps.executeQuery(); > > >>> > > >>> > > >>> > > >>> // We actually want to keep the case where multiple ordered locus > > >>> > > >>> // names appear. > > >>> > > >>> while (result.next()) { > > >>> > > >>> // We want this name to appear in the OrderedLocusNames > > >>> > > >>> // system table. > > >>> > > >>> for (String id : result.getString("value").split("/")) { > > >>> tableManager.submit( > > >>> > > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { > "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { > "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) > }, { "UID", row.getValue("UID") } }); > > >>> } > > >>> > > >>> } > > >>> > > >>> } > > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered > locus" and "orf" and we only want IDs that begin with "Rv". > > >>> > > >>> I know there are more comprehensive ways to search for gene > ID's by matching gene ID prefix but "like Rv%" seemed to work thus > far, we just need to tell it to search for XML tag type orf in > addition to ordered locus. > > >>> > > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like > 'Rv%' and entrytype_gene_hjid = ? " > > >>> > > >>> Here is a stab at it.... This part of our class was right as > the server went down and my submission for week 6 assignment I > can't seem to find. > > >>> > > >>> Is it possible to have two different types in the same query > or should we rewrite a separate query for the orf tag? > > >>> > > >>> Richard > > >>> > > >>> > > >>> > > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > >>> > > >>> thanks and will do as directed. > > >>> > > >>> My previous, last paragraph comment - A way for programming > code in email holding its format in a mail message similarly to how > you can post code on forum pages? > > >>> > > >>> <code> > > >>> blah > > >>> blah > > >>> blah > > >>> </code> > > >>> > > >>> thanks! > > >>> > > >>> Richard > > >>> > > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > >>> > > >>> Greetings, > > >>> > > >>> Actually, gmbuilder.properties is for the TallyEngine > only. When dealing with .gdb exports, look *only* at the > SpeciesProfile class. So, to find those 69 IDs, it is the > SpeciesProfile code, and *only* the SpeciesProfile code, that needs > to be changed. > > >>> > > >>> Your take on how gmbuilder.properties is used, however, is > understandable. It makes sense to assume that the TallyEngine code > *and* the ID export code are based on the same characterization of > the needed IDs. This replication is originally a historical > artifact: SpeciesProfile was done first, and then TallyEngine was > done later by another student. > > >>> > > >>> However, there are other factors beyond history that sort of > necessitate this duplication of desired IDs: (skip the two bullets > below if you'd rather cut to the chase of the work to be done, and > discuss design issues later) > > >>> > > >>> - The actual XML import code is a black box: this is the > "canned" JAXB library actually in action, and not our code at > all. Plus, the XML import code really does not filter (nor should > it), since the goal of the XML->relational database step is to > fully capture the XML data in the relational database. So, XML > count is necessarily separated from XML import. > > >>> > > >>> - The notion of a declarative mechanism for extracting IDs > from the relational database (which is what > gmbuilder.properties/TallyEngine uses) is interesting, but at the > same time there is value in the arbitrary computation that can be > done with Java (case in point: export two versions of an ID, with > and without periods). This is not to say that it is impossible to > do this declaratively, but let's just say that the procedural > approach exists here and now, and a declarative approach will need > more thought. > > >>> > > >>> These, and other factors, are good thoughts to hold onto and > would be worthy of a good meeting discussion sometime, but bottom > line for now: modifying the export behavior is a matter of editing > the *SpeciesProfile* Java code, and not the gmbuilder.properties > file. Turn your attention to that code. > > >>> > > >>> Now, as to annotating your code...I'd just put in code > comments :) Or did you mean something else by tagging code in e-mail? > > >>> > > >>> John David N. Dionisio, PhD > > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > > >>> > > >>> > also, how do I tag code in email so it holds its > formatting? I tried a few suggestions I found on the web but they > aren't holding formatting or i'm just doing it wrong ;-D > > >>> > > > >>> > Richard > > >>> > > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > >>> > OK, have some updates and some suggestions: > > >>> > > > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the > gene testing report. We verified that XML match does indeed find > 4066 unique matches - 75 of which are not in the gdb and need to be. > > >>> > > > >>> > Dr. Dahlquist informed me that she was the one who > completed the gene db testing report, not a previous student of > BIO367 and had already verified which genes were missing and where > they were to be found. I had (mistakenly) assumed that since a > student had performed the gene database testing I had to redo all > of the verification. > > >>> > > > >>> > So that said, of the 75 genes missing - 69 need to be > included and 6 excluded. > > >>> > Per the gene db testing report: "69 of them have an "a", > "b", or "d" suffix. They are all found in the ORF tag and need to > be included in the gdb." > > >>> > > > >>> > To solve this we need to add additional search criteria > into the M. tuberculosis section in gmbuilder.properties below: > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > > >>> > > > >>> > 1 > > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > > >>> > > > >>> > uniprot/entry/gene/name&type&ordered locus > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered > locus' and value like 'Rv%'; > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > SOLUTIONS: > > >>> > > > >>> > 1. So am i correct in my understanding that the second line > is the query used by TallyEngine to read the XML file? If so then > this is the issue we need to table for the moment until we get the > gbd verified and re-released. We will revisit this to discover why > it is not only reporting incorrectly but also why its added a > second row of Ordered Locus on the TallyEngine results page. > > >>> > > > >>> > 2. The third line is the SQL query used by postgres during > the export from XML to gdb. To find and get the ORF tagged genes > could we not add the following lines and change the count in the first line: > > >>> > > > >>> > > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount=2 > > >>> > > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > > >>> > Of course these queries would have be manually verified > prior to making these changes but this seems like we are moving in > the right direction. > > >>> > > > >>> > Richard > > >>> > > > >>> > > > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > >>> > Just got done reading previous email and understand the > change in priority. > > >>> > > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > > >>> > > > >>> > Also great about a more formalized weekly meeting. I was > going to suggest it myself as it has been slow going so far as > maybe i'm a bit too independent in this independent study class =D > > >>> > > > >>> > Will dig further into the missing ID's later tonight and > during day tomorrow and report back. > > >>> > > > >>> > Richard > > >>> > > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > >>> > Hi Rich, > > >>> > > > >>> > No problem. The pertinent line you're referring to, for > XML, is this, right above the line you copied: > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > >>> > The slash-separated section is the "path" of XML tags > leading to the element of interest; then, after the ampersand, is a > name/value pair for the desired attribute to count. Note that > there is no hint of a *content*-based filter (nor is there the > capability for one, as far as I can tell in the code). By > "content," I mean that we can't specify filters based on what's > *between* the tags. We can only go as far as filter by attribute > value, e.g., type="ordered locus". > > >>> > > > >>> > But anyway, as mentioned in the earlier e-mail, let's have > the missing IDs in the .gdb take precedence for now. Please take a > look at the tuberculosis, A. thaliana, and P. falciparum profiles > to get an idea for how the ID output can be customized, then let me > know if you have any questions or need to confirm anything. > > >>> > > > >>> > John David N. Dionisio, PhD > > >>> > Associate Professor, Computer Science > > >>> > Loyola Marymount University > > >>> > > > >>> > > > >>> > > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > >>> > > > >>> > > Sorry been slammed with a programming assignment that > kept needing continued iteration and it has been all consuming > until last night. But I did get a chance to work with your comments > and review the code again with a different mind set. > > >>> > > > > >>> > > Yes, I examined the gmbuilder.properties file ( the query > is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which > I mentioned in a previous email ) but I don't think I see what you > mean regarding the XML count. > > >>> > > > > >>> > > I understood that: > mycobacteriumtuberculosis_query_level0=select count(*) from > genenametype where type = 'ordered locus' and value like > 'Rv%'; was the db query but don't see which is the XML count... or > do they share the same query and you are saying that XML count > doesn't recognize and therefore cannot use the 'Rv%' parameter? > > >>> > > > > >>> > > Richard > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > >>> > > Hi Rich, > > >>> > > > > >>> > > Sorry for the delay. Had some distractions coming into > the weekend. > > >>> > > > > >>> > > You've looked at the code; have you looked at > gmbuilder.properties? (I may have mentioned it a few e-mails ago, > just as you were starting to dig into this) > > >>> > > > > >>> > > On the copy I have, the M. tuberculosis block looks like > this (indentation is mine to set it apart): > > >>> > > > > >>> > > # Mycobacterium tuberculosis > > >>> > > mycobacteriumtuberculosis_level_amount=1 > > >>> > > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > > >>> > > mycobacteriumtuberculosis_query_level0=select > count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > > >>> > > > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count > does not filter on RV%. The SQL query does. > > >>> > > > > >>> > > Unfortunately, I don't think the TallyEngine can include > selective filtering in the XML counts. If the need to do selective > filtering on XML is necessary, then I think we're looking at a new > functionality for you to implement (or, if this throws things off > too much, this may have to be noted somewhere, that the XML vs. > database counts may be off because the database count is doing some > text-based filtering but the XML count does not). > > >>> > > > > >>> > > What does xmlpipedb-match say? That will at least tell > you whether the 'RV%' count is indeed correct. > > >>> > > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > > >>> > > Loyola Marymount University > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > >>> > > > > >>> > > > OK here is what I was able to put together from the > past few hours of code review: > > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> > > > -reveals that after the 2 System table modifications > are made adding species name and link, a PreparedStatement is > instantiated which builds and calls the base query. > > >>> > > > > > >>> > > > -The base query called is: ("SELECT value, type " + > "FROM genenametype INNER JOIN entrytype_genetype " + > "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > "WHERE type = 'ordered locus' and value like 'Rv%' and > entrytype_gene_hjid = ?") > > >>> > > > > > >>> > > > -So its looking in 'ordered locus' table/column for any > tuple that starts with Rv (followed by any substring) and > entrytype_gene_hjid = ? . > > >>> > > > The 'like' comparator and % usage are clear with the > 'type' entrytype_gene_hjid = ? > > >>> > > > > > >>> > > > -To me it seems the query makes sense so the problem is > likely elsewhere. > > >>> > > > > > >>> > > > GenMappBuilder.java: > > >>> > > > -I found method doTallies() at code line 895 which: > > >>> > > > Instantiates a Configuration called > hibernateConfiguration and assigns to it the current hibernate configuration > > >>> > > > Validates database settings by analyzing hibernateConfiguration > > >>> > > > Instantiates a CriterionList for uniprot and assigns to > it TallyType.UNIPROT > > >>> > > > Instantiates a CriterionList for go and assigns to it > TallyType.GO > > >>> > > > Determines if both xml files exist > > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are > run on both xml files and their respective CriterionList > > >>> > > > Results are then formatted for display in a table. > > >>> > > > > > >>> > > > -So enum TallyType which means that they are the only > valid datatypes which TallyEngine accepts... go to know ... > > >>> > > > > > >>> > > > -Based on the screen shot of Tally Engine it would seem > that both getTallyResultsXML() and getTallyResultsDatabase() are > incorrectly returning. Likely due to both using an incorrect query > (as we previously supposed). But where are the queries?... the more > I dig the more I think they are in the criterial all the work is done against. > > >>> > > > > > >>> > > > continuing the review: > > >>> > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > > >>> > > > getTallyResultsDatabase() calls Tally Engine instance > method getDbcounts(new QueryEngine(hibernateConfiguration) > > >>> > > > Both of these instanced methods originate from > TallyEngine.java... > > >>> > > > > > >>> > > > TallyEngine.java: > > >>> > > > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which > instantiates a digester then processes against criteria... but this > quickly becomes confusing and is hard to follow > > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a > query but then I also get a bit lost with my limited db knowledge. > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > > >>> > > > OVERALL I think I'm getting closer to the issues but I > still feel as if I'm missing some understanding to proceed further. > Can you pass along some of that Dondi insight and steer me in the > right direction? =D > > >>> > > > > > >>> > > > -DB Tally - Not having taken databases yet certainly is > limiting my ability determine where the "criteria" are being set > and how they are followed during session activities. Also is the > query we have been looking for this whole time in the criteria or > someplace else? > > >>> > > > > > >>> > > > -XML Tally - again is the query contained within the > criteria that digestXmlFile() uses to parse? > > >>> > > > > > >>> > > > Richard > > >>> > > > > > >>> > > > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio > <<mailto:do...@lm...>do...@lm...> wrote: > > >>> > > > Right, schema issues are unlikely. Most count > discrepancies like this that I've seen have boiled down to forming > the right query. Then, knowing the right query (in both XML and > SQL), it's a matter of making sure that TallyEngine asks that same query. > > >>> > > > > > >>> > > > John David N. Dionisio, PhD > > >>> > > > Associate Professor, Computer Science > > >>> > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > >>> > > > > > >>> > > > > OK, so based on your approach: > > >>> > > > > > > >>> > > > > 1. I'll start with reviewing the queries for > xmlpipedb-match and sql queries needed for the respective results > as you requested. > > >>> > > > > > > >>> > > > > I was also thinking I may need to review the schema > from xml into postgres but the issue isn't likely a schema error. > The error most likely lies in how xmlpipedbutils queries the data > from xml source and writes to the tables what it returns? > > >>> > > > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally > engine in the gmbuilder code then follow it through the xmlpipedbutils. > > >>> > > > > > > >>> > > > > Richard > > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. > Dionisio <<mailto:do...@lm...>do...@lm...> wrote: > > >>> > > > > Just wanted to confirm (since I wasn't sure in the > first e-mail) --- the XMLPipeDB Utilities source code is in > trunk/xmlpipedbutils in SourceForge's Subversion repo. > > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > > >>> > > > > Associate Professor, Computer Science > > >>> > > > > Loyola Marymount University > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > >>> > > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis > that I was able to exactly reproduce Dr. Dahlquist's problematic > TallyEngine results. > > >>> > > > > > > > >>> > > > > > gmb2b60 Results > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Now the proverbial question - What next to solve > the Ordered Locus import/count issue? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > Here is my thought process: > > >>> > > > > > > > >>> > > > > > Step 1: How does the import process work at the > high level? (obviously correct me if I'm wrong) > > >>> > > > > > > > >>> > > > > > I believe that basically as each XML tag is read, > it is placed in the proper Postgres table(s) based on some > criteria. There is also likely some sort of check that each > individual tag is in valid XML format unless we don't care at this > stage (care at export) or maybe the parser just skips over and goes > on to the next . > > >>> > > > > > > > >>> > > > > > Step 2: What could be the problem? > > >>> > > > > > > > >>> > > > > > Either - > > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > >>> > > > > > b. Decision criteria of which table they should be added to? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > > > >>> > > > > > I read on the sourceforge wiki: > > >>> > > > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three > components that may be used separately or together. XSD-to-DB reads > an XSD (XML Schema Definition) and automatically generates an SQL > schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities > provides functionality for configuring the database, importing > data, and performing queries. GenMAPP Builder is based on the > XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases > based on data from UniProt and Gene Ontology (GO). > > >>> > > > > > > > >>> > > > > > So I should probably start with the XMLPipeDB > Utilities which are where? I don't see any in the basic > distribution or are they not standalone and called from the command line? > > >>> > > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > > >>> > > > > > Richard > > >>> > > > > > > >>> > > > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------ > > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) > Parallel Studio XE: > > >>> > > > Pinpoint memory and threading errors before they happen. > > >>> > > > Find and fix more than 250 security defects in the > development cycle. > > >>> > > > Locate bottlenecks in serial and parallel code that > limit performance. > > >>> > > > > <http://p.sf.net/sfu/intel-dev2devfeb>http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > > _______________________________________________ > > >>> > > > xmlpipedb-developer mailing list > > >>> > > > > <mailto:xml...@li...>xml...@li... > > >>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > > >>> > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------------------ > > >>> > > The ultimate all-in-one performance toolkit: Intel(R) > Parallel Studio XE: > > >>> > > Pinpoint memory and threading errors before they happen. > > >>> > > Find and fix more than 250 security defects in the > development cycle. > > >>> > > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > > <http://p.sf.net/sfu/intel-dev2devfeb>http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > _______________________________________________ > > >>> > > xmlpipedb-developer mailing list > > >>> > > > <mailto:xml...@li...>xml...@li... > > >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > >>> > > <ATT00001..txt><ATT00002..txt> > > >>> > > > >>> > > > >>> > > ------------------------------------------------------------------------------ > > >>> > The ultimate all-in-one performance toolkit: Intel(R) > Parallel Studio XE: > > >>> > Pinpoint memory and threading errors before they happen. > > >>> > Find and fix more than 250 security defects in the development cycle. > > >>> > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > <http://p.sf.net/sfu/intel-dev2devfeb>http://p.sf.net/sfu/intel-dev2devfeb > > >>> > _______________________________________________ > > >>> > xmlpipedb-developer mailing list > > >>> > > <mailto:xml...@li...>xml...@li... > > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > >>> > > > >>> > > > >>> > <ATT00001..txt><ATT00002..txt> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> The ultimate all-in-one performance toolkit: Intel(R) > Parallel Studio XE: > > >>> Pinpoint memory and threading errors before they happen. > > >>> Find and fix more than 250 security defects in the development cycle. > > >>> Locate bottlenecks in serial and parallel code that limit performance. > > >>> > <http://p.sf.net/sfu/intel-dev2devfeb>http://p.sf.net/sfu/intel-dev2devfeb > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> > <mailto:xml...@li...>xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > >>> Collect, index and harness all the fast moving IT data > generated by your > > >>> applications, servers and devices whether physical, virtual > or in the cloud. > > >>> Deliver compliance at lower cost and gain new business insights. > > >>> Free Software Download: > <http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> > <mailto:xml...@li...>xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > >> Collect, index and harness all the fast moving IT data generated by your > > >> applications, servers and devices whether physical, virtual or > in the cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: > <http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> > <mailto:xml...@li...>xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > >> Collect, index and harness all the fast moving IT data generated by your > > >> applications, servers and devices whether physical, virtual or > in the cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: > <http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> > <mailto:xml...@li...>xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > ------------------------------------------------------------------------------ > > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > > Collect, index and harness all the fast moving IT data generated by your > > > applications, servers and devices whether physical, virtual or > in the cloud. > > > Deliver compliance at lower cost and gain new business insights. > > > Free Software Download: > <http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > > <mailto:xml...@li...>xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > Collect, index and harness all the fast moving IT data generated by your > > applications, servers and devices whether physical, virtual or in > the cloud. > > Deliver compliance at lower cost and gain new business insights. > > Free Software Download: > <http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > > <mailto:xml...@li...>xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > >------------------------------------------------------------------------------ >Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >Collect, index and harness all the fast moving IT data generated by your >applications, servers and devices whether physical, virtual or in the cloud. >Deliver compliance at lower cost and gain new business insights. >Free Software Download: ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > |
From: Richard B. <rbr...@gm...> - 2011-02-22 21:09:31
|
Here is an export of the genes found using: select * from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The item of note I see is the gene with the slash separating gene id's which refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...> wrote: > Understood. > > I'll check in with Dr. D in the afternoon tomorrow and discuss. > > Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...>wrote: > >> Hi Rich, >> >> Addressing the release business first, let's put it this way: if the >> remaining loose ends can be addressed by tomorrow, we can probably wait >> until then. If unexpected snags are encountered, then it would be >> worthwhile to release whatever you have. >> >> With that in mind, considering that you pretty much know the patterns of >> the IDs that are needed, I think it will only take a little digital forensic >> work now to figure out exactly which IDs are still needed. Once you know >> what those are, you should: >> >> 1. Find where they are in the XML file. >> 2. Knowing the XML location, find the corresponding table in the >> relational database (table names are generally derived from tag/element >> names). >> 3. Knowing the table in the database, write or extend the SpeciesProfile >> query to retrieve that data. >> >> For the ID that must *not* be included, again it's a matter of tracking >> down what this ID is. Knowing this straggler, you can then consult with Dr. >> Dahlquist if this ID is truly a unique one-off, or is representative of a >> pattern that we'll want to exclude. Either way, this ID can be omitted by >> using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL >> where clause syntax to see where the negation can be applied). >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >> >> > actually i had a typo (emailing from desktop system but testing on my >> laptop... typed correctly here but wrong in pgadmin) but the results make >> much more sense now. >> > >> > >> > select count (*) from genenametype where (type = 'ordered locus' or type >> = 'ORF') and value like 'Rv%'; >> > returns 4058 >> > >> > select count (*) from genenametype where (type = 'ordered locus' or type >> = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > returns 4058 >> > >> > >> -------------------------------------------------------------------------------------------------------------------------- >> > >> > >> > Continuing forward - >> > >> > The testing report says that 4066 unique matches exist in XML but 6 of >> them were eliminated by Dr. D leaving the desired number at 4060. >> > >> > So now we are only 2 genes short with the query returning 4058... which >> is also (conveniently) the sum of the two separate queries of 'ordered >> locus' and 'ORF' respectively. >> > >> > But recall that Dr. D said that only 69 genes of the missing 75 were >> tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. >> Adding that into missing genes puts us 3 short... >> > >> > Should I make the changes to the code and export a gdb so that analysis >> can be done or wait until we work this through further? >> > >> > Richard >> > >> > >> > >> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > Hi Rich, >> > >> > The second form should have worked actually. What exactly was the >> error? >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >> > >> > > hmm not taking parenthesis where I thought they should go... syntax >> error >> > > >> > > select count (*) from genenametype where type = ('ordered locus' or >> 'ORF') and value like 'Rv%'; >> > > also tried >> > > select count (*) from genenametype where (type = 'ordered locus' or >> type = 'ORF') and value like 'Rv%'; >> > > >> > > >> > > >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> >> wrote: >> > > ah yes... i see it... >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > Watch your parentheses: "and" has greater precedence than "or" :) >> > > >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> >> wrote: >> > > >> > >> OK, so here are my query results from raw SQL: >> > >> >> > >> 1. using: like 'Rv%' >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' and >> value like 'Rv%'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and value like >> 'Rv%'; >> > >> returns 70 >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' or >> type = 'ORF' and value like 'Rv%'; >> > >> returns 7011 >> > >> >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' and >> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' or >> type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 7011 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and value ~ >> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 70 >> > >> >> > >> Conclusions: >> > >> >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 >> genes were were looking for plus one more (maybe the count for missing genes >> is off by 1?). >> > >> >> > >> 2. Combining the two types in a single query did not produce the >> results that I expected (7011? - how did that happen????) so this is likely >> not our solution... unless of course the query syntax isn't actually doing >> what I think it is... >> > >> >> > >> 3. I would think the best course of action is to serialy run two >> separate queries to capture all the required genes, then removing the one >> unneeded gene if its truly not wanted. >> > >> >> > >> What do you think? >> > >> >> > >> Richard >> > >> >> > >> >> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >> I don't recall the exact details of the missing 69, but if your query >> successfully returns them in raw SQL, then this is worth a try. You can >> integrate into the same query as long as the same columns are returned, >> which is the case here AFAIK, so go ahead and extend the existing query. >> > >> >> > >> >> > >> John David N. Dionisio, PhD >> > >> Associate Professor, Computer Science >> > >> Loyola Marymount University >> > >> >> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> >> wrote: >> > >> >> > >>> So here is the appropriate code snippet from >> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> public >> > >>> >> > >>> TableManager getSystemTableManagerCustomizations(TableManager >> tableManager, TableManager primarySystemTableManager, Date version) throws >> SQLException, InvalidParameterException { >> > >>> >> > >>> // Build the base query; we only use "ordered locus" and we only >> want >> > >>> >> > >>> // IDs that begin with "Rv." >> > >>> PreparedStatement ps = >> ConnectionManager.getRelationalDBConnection().prepareStatement( >> > >>> >> > >>> "SELECT value, type " + >> > >>> >> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >> > >>> >> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> > >>> >> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >> entrytype_gene_hjid = ?"); >> > >>> ResultSet result; >> > >>> >> > >>> >> > >>> >> > >>> for (Row row : primarySystemTableManager.getRows()) { >> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >> > >>> >> > >>> "UID"))); >> > >>> result = ps.executeQuery(); >> > >>> >> > >>> >> > >>> >> > >>> // We actually want to keep the case where multiple ordered locus >> > >>> >> > >>> // names appear. >> > >>> >> > >>> while (result.next()) { >> > >>> >> > >>> // We want this name to appear in the OrderedLocusNames >> > >>> >> > >>> // system table. >> > >>> >> > >>> for (String id : result.getString("value").split("/")) { >> > >>> tableManager.submit( >> > >>> >> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id >> }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", >> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", >> row.getValue("UID") } }); >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> > >>> So now we want to build the base query which uses "ordered locus" >> and "orf" and we only want IDs that begin with "Rv". >> > >>> >> > >>> I know there are more comprehensive ways to search for gene ID's by >> matching gene ID prefix but "like Rv%" seemed to work thus far, we just need >> to tell it to search for XML tag type orf in addition to ordered locus. >> > >>> >> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' >> and entrytype_gene_hjid = ? " >> > >>> >> > >>> Here is a stab at it.... This part of our class was right as the >> server went down and my submission for week 6 assignment I can't seem to >> find. >> > >>> >> > >>> Is it possible to have two different types in the same query or >> should we rewrite a separate query for the orf tag? >> > >>> >> > >>> Richard >> > >>> >> > >>> >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> >> wrote: >> > >>> >> > >>> thanks and will do as directed. >> > >>> >> > >>> My previous, last paragraph comment - A way for programming code in >> email holding its format in a mail message similarly to how you can post >> code on forum pages? >> > >>> >> > >>> <code> >> > >>> blah >> > >>> blah >> > >>> blah >> > >>> </code> >> > >>> >> > >>> thanks! >> > >>> >> > >>> Richard >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> >> > >>> Greetings, >> > >>> >> > >>> Actually, gmbuilder.properties is for the TallyEngine only. When >> dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to >> find those 69 IDs, it is the SpeciesProfile code, and *only* the >> SpeciesProfile code, that needs to be changed. >> > >>> >> > >>> Your take on how gmbuilder.properties is used, however, is >> understandable. It makes sense to assume that the TallyEngine code *and* >> the ID export code are based on the same characterization of the needed IDs. >> This replication is originally a historical artifact: SpeciesProfile was >> done first, and then TallyEngine was done later by another student. >> > >>> >> > >>> However, there are other factors beyond history that sort of >> necessitate this duplication of desired IDs: (skip the two bullets below if >> you'd rather cut to the chase of the work to be done, and discuss design >> issues later) >> > >>> >> > >>> - The actual XML import code is a black box: this is the "canned" >> JAXB library actually in action, and not our code at all. Plus, the XML >> import code really does not filter (nor should it), since the goal of the >> XML->relational database step is to fully capture the XML data in the >> relational database. So, XML count is necessarily separated from XML >> import. >> > >>> >> > >>> - The notion of a declarative mechanism for extracting IDs from the >> relational database (which is what gmbuilder.properties/TallyEngine uses) is >> interesting, but at the same time there is value in the arbitrary >> computation that can be done with Java (case in point: export two versions >> of an ID, with and without periods). This is not to say that it is >> impossible to do this declaratively, but let's just say that the procedural >> approach exists here and now, and a declarative approach will need more >> thought. >> > >>> >> > >>> These, and other factors, are good thoughts to hold onto and would >> be worthy of a good meeting discussion sometime, but bottom line for now: >> modifying the export behavior is a matter of editing the *SpeciesProfile* >> Java code, and not the gmbuilder.properties file. Turn your attention to >> that code. >> > >>> >> > >>> Now, as to annotating your code...I'd just put in code comments :) >> Or did you mean something else by tagging code in e-mail? >> > >>> >> > >>> John David N. Dionisio, PhD >> > >>> Associate Professor, Computer Science >> > >>> Loyola Marymount University >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> > >>> >> > >>> > also, how do I tag code in email so it holds its formatting? I >> tried a few suggestions I found on the web but they aren't holding >> formatting or i'm just doing it wrong ;-D >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous < >> rbr...@gm...> wrote: >> > >>> > OK, have some updates and some suggestions: >> > >>> > >> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene >> testing report. We verified that XML match does indeed find 4066 unique >> matches - 75 of which are not in the gdb and need to be. >> > >>> > >> > >>> > Dr. Dahlquist informed me that she was the one who completed the >> gene db testing report, not a previous student of BIO367 and had already >> verified which genes were missing and where they were to be found. I had >> (mistakenly) assumed that since a student had performed the gene database >> testing I had to redo all of the verification. >> > >>> > >> > >>> > So that said, of the 75 genes missing - 69 need to be included and >> 6 excluded. >> > >>> > Per the gene db testing report: "69 of them have an "a", "b", or >> "d" suffix. They are all found in the ORF tag and need to be included in the >> gdb." >> > >>> > >> > >>> > To solve this we need to add additional search criteria into the >> M. tuberculosis section in gmbuilder.properties below: >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount= >> > >>> > >> > >>> > 1 >> > >>> > >> > >>> > mycobacteriumtuberculosis_element_level0= >> > >>> > >> > >>> > uniprot/entry/gene/name&type&ordered locus >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered locus' and >> value like 'Rv%'; >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > SOLUTIONS: >> > >>> > >> > >>> > 1. So am i correct in my understanding that the second line is the >> query used by TallyEngine to read the XML file? If so then this is the issue >> we need to table for the moment until we get the gbd verified and >> re-released. We will revisit this to discover why it is not only reporting >> incorrectly but also why its added a second row of Ordered Locus on the >> TallyEngine results page. >> > >>> > >> > >>> > 2. The third line is the SQL query used by postgres during the >> export from XML to gdb. To find and get the ORF tagged genes could we not >> add the following lines and change the count in the first line: >> > >>> > >> > >>> > >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount=2 >> > >>> > >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered locus'; >> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from >> genenametype where type = 'orf'; >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >>> > >> > >>> > >> ---------------------------------------------------------------------------------------------------------------------------- >> > >>> > >> > >>> > Of course these queries would have be manually verified prior to >> making these changes but this seems like we are moving in the right >> direction. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous < >> rbr...@gm...> wrote: >> > >>> > Just got done reading previous email and understand the change in >> priority. >> > >>> > >> > >>> > Will work on the missing ID's for now and shelve the the >> TalleyEngine issue for the moment. >> > >>> > >> > >>> > Also great about a more formalized weekly meeting. I was going to >> suggest it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> > >>> > >> > >>> > Will dig further into the missing ID's later tonight and during >> day tomorrow and report back. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > Hi Rich, >> > >>> > >> > >>> > No problem. The pertinent line you're referring to, for XML, is >> this, right above the line you copied: >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> > >>> > The slash-separated section is the "path" of XML tags leading to >> the element of interest; then, after the ampersand, is a name/value pair for >> the desired attribute to count. Note that there is no hint of a >> *content*-based filter (nor is there the capability for one, as far as I can >> tell in the code). By "content," I mean that we can't specify filters based >> on what's *between* the tags. We can only go as far as filter by attribute >> value, e.g., type="ordered locus". >> > >>> > >> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the >> missing IDs in the .gdb take precedence for now. Please take a look at the >> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >> the ID output can be customized, then let me know if you have any questions >> or need to confirm anything. >> > >>> > >> > >>> > John David N. Dionisio, PhD >> > >>> > Associate Professor, Computer Science >> > >>> > Loyola Marymount University >> > >>> > >> > >>> > >> > >>> > >> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >>> > >> > >>> > > Sorry been slammed with a programming assignment that kept >> needing continued iteration and it has been all consuming until last night. >> But I did get a chance to work with your comments and review the code again >> with a different mind set. >> > >>> > > >> > >>> > > Yes, I examined the gmbuilder.properties file ( the query is >> also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >> in a previous email ) but I don't think I see what you mean regarding the >> XML count. >> > >>> > > >> > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... or do they >> share the same query and you are saying that XML count doesn't recognize and >> therefore cannot use the 'Rv%' parameter? >> > >>> > > >> > >>> > > Richard >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > Hi Rich, >> > >>> > > >> > >>> > > Sorry for the delay. Had some distractions coming into the >> weekend. >> > >>> > > >> > >>> > > You've looked at the code; have you looked at >> gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as >> you were starting to dig into this) >> > >>> > > >> > >>> > > On the copy I have, the M. tuberculosis block looks like this >> (indentation is mine to set it apart): >> > >>> > > >> > >>> > > # Mycobacterium tuberculosis >> > >>> > > mycobacteriumtuberculosis_level_amount=1 >> > >>> > > >> > >>> > > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) >> from genenametype where type = 'ordered locus' and value like 'Rv%'; >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > >>> > > >> > >>> > > There, I think, is the rub. Notice that the XML count does not >> filter on RV%. The SQL query does. >> > >>> > > >> > >>> > > Unfortunately, I don't think the TallyEngine can include >> selective filtering in the XML counts. If the need to do selective >> filtering on XML is necessary, then I think we're looking at a new >> functionality for you to implement (or, if this throws things off too much, >> this may have to be noted somewhere, that the XML vs. database counts may be >> off because the database count is doing some text-based filtering but the >> XML count does not). >> > >>> > > >> > >>> > > What does xmlpipedb-match say? That will at least tell you >> whether the 'RV%' count is indeed correct. >> > >>> > > >> > >>> > > John David N. Dionisio, PhD >> > >>> > > Associate Professor, Computer Science >> > >>> > > Loyola Marymount University >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > >>> > > >> > >>> > > > OK here is what I was able to put together from the past few >> hours of code review: >> > >>> > > > >> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> > > > -reveals that after the 2 System table modifications are made >> adding species name and link, a PreparedStatement is instantiated which >> builds and calls the base query. >> > >>> > > > >> > >>> > > > -The base query called is: ("SELECT value, type " + "FROM >> genenametype INNER JOIN entrytype_genetype " + >> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >> > >>> > > > >> > >>> > > > -So its looking in 'ordered locus' table/column for any tuple >> that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? >> . >> > >>> > > > The 'like' comparator and % usage are clear with the 'type' >> entrytype_gene_hjid = ? >> > >>> > > > >> > >>> > > > -To me it seems the query makes sense so the problem is likely >> elsewhere. >> > >>> > > > >> > >>> > > > GenMappBuilder.java: >> > >>> > > > -I found method doTallies() at code line 895 which: >> > >>> > > > Instantiates a Configuration called hibernateConfiguration and >> assigns to it the current hibernate configuration >> > >>> > > > Validates database settings by analyzing >> hibernateConfiguration >> > >>> > > > Instantiates a CriterionList for uniprot and assigns to it >> TallyType.UNIPROT >> > >>> > > > Instantiates a CriterionList for go and assigns to it >> TallyType.GO >> > >>> > > > Determines if both xml files exist >> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on >> both xml files and their respective CriterionList >> > >>> > > > Results are then formatted for display in a table. >> > >>> > > > >> > >>> > > > -So enum TallyType which means that they are the only valid >> datatypes which TallyEngine accepts... go to know ... >> > >>> > > > >> > >>> > > > -Based on the screen shot of Tally Engine it would seem that >> both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >> returning. Likely due to both using an incorrect query (as we previously >> supposed). But where are the queries?... the more I dig the more I think >> they are in the criterial all the work is done against. >> > >>> > > > >> > >>> > > > continuing the review: >> > >>> > > > getTallyResultsXML() calls Tally Engine instance method >> getXmlFileCounts(xmlFile) >> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method >> getDbcounts(new QueryEngine(hibernateConfiguration) >> > >>> > > > Both of these instanced methods originate from >> TallyEngine.java... >> > >>> > > > >> > >>> > > > TallyEngine.java: >> > >>> > > > >> > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >> digester then processes against criteria... but this quickly becomes >> confusing and is hard to follow >> > >>> > > > >> > >>> > > > getDbcounts() then starts a db session and executes a query >> but then I also get a bit lost with my limited db knowledge. >> > >>> > > > >> > >>> > > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > >>> > > > >> > >>> > > > OVERALL I think I'm getting closer to the issues but I still >> feel as if I'm missing some understanding to proceed further. Can you pass >> along some of that Dondi insight and steer me in the right direction? =D >> > >>> > > > >> > >>> > > > -DB Tally - Not having taken databases yet certainly is >> limiting my ability determine where the "criteria" are being set and how >> they are followed during session activities. Also is the query we have been >> looking for this whole time in the criteria or someplace else? >> > >>> > > > >> > >>> > > > -XML Tally - again is the query contained within the criteria >> that digestXmlFile() uses to parse? >> > >>> > > > >> > >>> > > > Richard >> > >>> > > > >> > >>> > > > >> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > > Right, schema issues are unlikely. Most count discrepancies >> like this that I've seen have boiled down to forming the right query. Then, >> knowing the right query (in both XML and SQL), it's a matter of making sure >> that TallyEngine asks that same query. >> > >>> > > > >> > >>> > > > John David N. Dionisio, PhD >> > >>> > > > Associate Professor, Computer Science >> > >>> > > > Loyola Marymount University >> > >>> > > > >> > >>> > > > >> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > >>> > > > >> > >>> > > > > OK, so based on your approach: >> > >>> > > > > >> > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match >> and sql queries needed for the respective results as you requested. >> > >>> > > > > >> > >>> > > > > I was also thinking I may need to review the schema from xml >> into postgres but the issue isn't likely a schema error. The error most >> likely lies in how xmlpipedbutils queries the data from xml source and >> writes to the tables what it returns? >> > >>> > > > > >> > >>> > > > > 2. I'll review the code: trace the entrance of tally engine >> in the gmbuilder code then follow it through the xmlpipedbutils. >> > >>> > > > > >> > >>> > > > > Richard >> > >>> > > > > >> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > > > Just wanted to confirm (since I wasn't sure in the first >> e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils >> in SourceForge's Subversion repo. >> > >>> > > > > >> > >>> > > > > John David N. Dionisio, PhD >> > >>> > > > > Associate Professor, Computer Science >> > >>> > > > > Loyola Marymount University >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > >>> > > > > >> > >>> > > > > > Hi Dondi, >> > >>> > > > > > >> > >>> > > > > > So I'm at the point in working with M tuberculosis that I >> was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine >> results. >> > >>> > > > > > >> > >>> > > > > > gmb2b60 Results >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > Now the proverbial question - What next to solve the >> Ordered Locus import/count issue? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > Here is my thought process: >> > >>> > > > > > >> > >>> > > > > > Step 1: How does the import process work at the high >> level? (obviously correct me if I'm wrong) >> > >>> > > > > > >> > >>> > > > > > I believe that basically as each XML tag is read, it is >> placed in the proper Postgres table(s) based on some criteria. There is also >> likely some sort of check that each individual tag is in valid XML format >> unless we don't care at this stage (care at export) or maybe the parser just >> skips over and goes on to the next . >> > >>> > > > > > >> > >>> > > > > > Step 2: What could be the problem? >> > >>> > > > > > >> > >>> > > > > > Either - >> > >>> > > > > > a. XML tags are being parsed incorrectly >> (ignored/skipped)? >> > >>> > > > > > b. Decision criteria of which table they should be added >> to? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > >> > >>> > > > > > I read on the sourceforge wiki: >> > >>> > > > > > >> > >>> > > > > > XMLPipeDB has a modular architecture with three components >> that may be used separately or together. XSD-to-DB reads an XSD (XML Schema >> Definition) and automatically generates an SQL schema, Java classes, and >> Hibernate mappings. XMLPipeDB Utilities provides functionality for >> configuring the database, importing data, and performing queries. GenMAPP >> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >> Gene Databases based on data from UniProt and Gene Ontology (GO). >> > >>> > > > > > >> > >>> > > > > > So I should probably start with the XMLPipeDB Utilities >> which are where? I don't see any in the basic distribution or are they not >> standalone and called from the command line? >> > >>> > > > > > >> > >>> > > > > > Thanks! >> > >>> > > > > > >> > >>> > > > > > Richard >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > <ATT00001..txt><ATT00002..txt> >> > >>> > > > >> > >>> > > > >> > >>> > > > >> ------------------------------------------------------------------------------ >> > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > > > Pinpoint memory and threading errors before they happen. >> > >>> > > > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > > > _______________________________________________ >> > >>> > > > xmlpipedb-developer mailing list >> > >>> > > > xml...@li... >> > >>> > > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > > > >> > >>> > > > <ATT00001..txt><ATT00002..txt> >> > >>> > > >> > >>> > > >> > >>> > > >> ------------------------------------------------------------------------------ >> > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > > Pinpoint memory and threading errors before they happen. >> > >>> > > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > > _______________________________________________ >> > >>> > > xmlpipedb-developer mailing list >> > >>> > > xml...@li... >> > >>> > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > > >> > >>> > > <ATT00001..txt><ATT00002..txt> >> > >>> > >> > >>> > >> > >>> > >> ------------------------------------------------------------------------------ >> > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > Pinpoint memory and threading errors before they happen. >> > >>> > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > _______________________________________________ >> > >>> > xmlpipedb-developer mailing list >> > >>> > xml...@li... >> > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > >> > >>> > >> > >>> > >> > >>> > <ATT00001..txt><ATT00002..txt> >> > >>> >> > >>> >> > >>> >> ------------------------------------------------------------------------------ >> > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> Pinpoint memory and threading errors before they happen. >> > >>> Find and fix more than 250 security defects in the development >> cycle. >> > >>> Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> http://p.sf.net/sfu/intel-dev2devfeb >> > >>> _______________________________________________ >> > >>> xmlpipedb-developer mailing list >> > >>> xml...@li... >> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> >> > >>> >> > >>> >> > >>> >> ------------------------------------------------------------------------------ >> > >>> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >>> Collect, index and harness all the fast moving IT data generated by >> your >> > >>> applications, servers and devices whether physical, virtual or in >> the cloud. >> > >>> Deliver compliance at lower cost and gain new business insights. >> > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >>> _______________________________________________ >> > >>> xmlpipedb-developer mailing list >> > >>> xml...@li... >> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> >> > >> >> ------------------------------------------------------------------------------ >> > >> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >> Collect, index and harness all the fast moving IT data generated by >> your >> > >> applications, servers and devices whether physical, virtual or in the >> cloud. >> > >> Deliver compliance at lower cost and gain new business insights. >> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >> _______________________________________________ >> > >> xmlpipedb-developer mailing list >> > >> xml...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> >> > >> >> > >> >> ------------------------------------------------------------------------------ >> > >> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >> Collect, index and harness all the fast moving IT data generated by >> your >> > >> applications, servers and devices whether physical, virtual or in the >> cloud. >> > >> Deliver compliance at lower cost and gain new business insights. >> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >> _______________________________________________ >> > >> xmlpipedb-developer mailing list >> > >> xml...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > >> ------------------------------------------------------------------------------ >> > > Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > > Collect, index and harness all the fast moving IT data generated by >> your >> > > applications, servers and devices whether physical, virtual or in the >> cloud. >> > > Deliver compliance at lower cost and gain new business insights. >> > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > >> ------------------------------------------------------------------------------ >> > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> > Collect, index and harness all the fast moving IT data generated by your >> > applications, servers and devices whether physical, virtual or in the >> cloud. >> > Deliver compliance at lower cost and gain new business insights. >> > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > |
From: Richard B. <rbr...@gm...> - 2011-02-22 07:11:18
|
Understood. I'll check in with Dr. D in the afternoon tomorrow and discuss. Richard On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...>wrote: > Hi Rich, > > Addressing the release business first, let's put it this way: if the > remaining loose ends can be addressed by tomorrow, we can probably wait > until then. If unexpected snags are encountered, then it would be > worthwhile to release whatever you have. > > With that in mind, considering that you pretty much know the patterns of > the IDs that are needed, I think it will only take a little digital forensic > work now to figure out exactly which IDs are still needed. Once you know > what those are, you should: > > 1. Find where they are in the XML file. > 2. Knowing the XML location, find the corresponding table in the relational > database (table names are generally derived from tag/element names). > 3. Knowing the table in the database, write or extend the SpeciesProfile > query to retrieve that data. > > For the ID that must *not* be included, again it's a matter of tracking > down what this ID is. Knowing this straggler, you can then consult with Dr. > Dahlquist if this ID is truly a unique one-off, or is representative of a > pattern that we'll want to exclude. Either way, this ID can be omitted by > using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL > where clause syntax to see where the negation can be applied). > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > > > actually i had a typo (emailing from desktop system but testing on my > laptop... typed correctly here but wrong in pgadmin) but the results make > much more sense now. > > > > > > select count (*) from genenametype where (type = 'ordered locus' or type > = 'ORF') and value like 'Rv%'; > > returns 4058 > > > > select count (*) from genenametype where (type = 'ordered locus' or type > = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > returns 4058 > > > > > -------------------------------------------------------------------------------------------------------------------------- > > > > > > Continuing forward - > > > > The testing report says that 4066 unique matches exist in XML but 6 of > them were eliminated by Dr. D leaving the desired number at 4060. > > > > So now we are only 2 genes short with the query returning 4058... which > is also (conveniently) the sum of the two separate queries of 'ordered > locus' and 'ORF' respectively. > > > > But recall that Dr. D said that only 69 genes of the missing 75 were > tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. > Adding that into missing genes puts us 3 short... > > > > Should I make the changes to the code and export a gdb so that analysis > can be done or wait until we work this through further? > > > > Richard > > > > > > > > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > The second form should have worked actually. What exactly was the error? > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > > > > > hmm not taking parenthesis where I thought they should go... syntax > error > > > > > > select count (*) from genenametype where type = ('ordered locus' or > 'ORF') and value like 'Rv%'; > > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or > type = 'ORF') and value like 'Rv%'; > > > > > > > > > > > > > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> > wrote: > > > ah yes... i see it... > > > > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> > wrote: > > > Watch your parentheses: "and" has greater precedence than "or" :) > > > > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > > > > > >> OK, so here are my query results from raw SQL: > > >> > > >> 1. using: like 'Rv%' > > >> > > >> select count (*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ORF' and value like > 'Rv%'; > > >> returns 70 > > >> > > >> select count (*) from genenametype where type = 'ordered locus' or > type = 'ORF' and value like 'Rv%'; > > >> returns 7011 > > >> > > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > > >> select count (*) from genenametype where type = 'ordered locus' and > value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 3988 > > >> > > >> select count (*) from genenametype where type = 'ordered locus' or > type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 7011 > > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > > >> returns 70 > > >> > > >> Conclusions: > > >> > > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes > were were looking for plus one more (maybe the count for missing genes is > off by 1?). > > >> > > >> 2. Combining the two types in a single query did not produce the > results that I expected (7011? - how did that happen????) so this is likely > not our solution... unless of course the query syntax isn't actually doing > what I think it is... > > >> > > >> 3. I would think the best course of action is to serialy run two > separate queries to capture all the required genes, then removing the one > unneeded gene if its truly not wanted. > > >> > > >> What do you think? > > >> > > >> Richard > > >> > > >> > > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio < > do...@lm...> wrote: > > >> I don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > > >> > > >> John David N. Dionisio, PhD > > >> Associate Professor, Computer Science > > >> Loyola Marymount University > > >> > > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> > wrote: > > >> > > >>> So here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> public > > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > > >>> > > >>> // Build the base query; we only use "ordered locus" and we only want > > >>> > > >>> // IDs that begin with "Rv." > > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > > >>> > > >>> "SELECT value, type " + > > >>> > > >>> "FROM genenametype INNER JOIN entrytype_genetype " + > > >>> > > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > > >>> > > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and > entrytype_gene_hjid = ?"); > > >>> ResultSet result; > > >>> > > >>> > > >>> > > >>> for (Row row : primarySystemTableManager.getRows()) { > > >>> ps.setInt(1, Integer.parseInt(row.getValue( > > >>> > > >>> "UID"))); > > >>> result = ps.executeQuery(); > > >>> > > >>> > > >>> > > >>> // We actually want to keep the case where multiple ordered locus > > >>> > > >>> // names appear. > > >>> > > >>> while (result.next()) { > > >>> > > >>> // We want this name to appear in the OrderedLocusNames > > >>> > > >>> // system table. > > >>> > > >>> for (String id : result.getString("value").split("/")) { > > >>> tableManager.submit( > > >>> > > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, > { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", > GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", > row.getValue("UID") } }); > > >>> } > > >>> > > >>> } > > >>> > > >>> } > > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > > >>> > > >>> I know there are more comprehensive ways to search for gene ID's by > matching gene ID prefix but "like Rv%" seemed to work thus far, we just need > to tell it to search for XML tag type orf in addition to ordered locus. > > >>> > > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' > and entrytype_gene_hjid = ? " > > >>> > > >>> Here is a stab at it.... This part of our class was right as the > server went down and my submission for week 6 assignment I can't seem to > find. > > >>> > > >>> Is it possible to have two different types in the same query or > should we rewrite a separate query for the orf tag? > > >>> > > >>> Richard > > >>> > > >>> > > >>> > > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > > >>> thanks and will do as directed. > > >>> > > >>> My previous, last paragraph comment - A way for programming code in > email holding its format in a mail message similarly to how you can post > code on forum pages? > > >>> > > >>> <code> > > >>> blah > > >>> blah > > >>> blah > > >>> </code> > > >>> > > >>> thanks! > > >>> > > >>> Richard > > >>> > > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > >>> Greetings, > > >>> > > >>> Actually, gmbuilder.properties is for the TallyEngine only. When > dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to > find those 69 IDs, it is the SpeciesProfile code, and *only* the > SpeciesProfile code, that needs to be changed. > > >>> > > >>> Your take on how gmbuilder.properties is used, however, is > understandable. It makes sense to assume that the TallyEngine code *and* > the ID export code are based on the same characterization of the needed IDs. > This replication is originally a historical artifact: SpeciesProfile was > done first, and then TallyEngine was done later by another student. > > >>> > > >>> However, there are other factors beyond history that sort of > necessitate this duplication of desired IDs: (skip the two bullets below if > you'd rather cut to the chase of the work to be done, and discuss design > issues later) > > >>> > > >>> - The actual XML import code is a black box: this is the "canned" > JAXB library actually in action, and not our code at all. Plus, the XML > import code really does not filter (nor should it), since the goal of the > XML->relational database step is to fully capture the XML data in the > relational database. So, XML count is necessarily separated from XML > import. > > >>> > > >>> - The notion of a declarative mechanism for extracting IDs from the > relational database (which is what gmbuilder.properties/TallyEngine uses) is > interesting, but at the same time there is value in the arbitrary > computation that can be done with Java (case in point: export two versions > of an ID, with and without periods). This is not to say that it is > impossible to do this declaratively, but let's just say that the procedural > approach exists here and now, and a declarative approach will need more > thought. > > >>> > > >>> These, and other factors, are good thoughts to hold onto and would be > worthy of a good meeting discussion sometime, but bottom line for now: > modifying the export behavior is a matter of editing the *SpeciesProfile* > Java code, and not the gmbuilder.properties file. Turn your attention to > that code. > > >>> > > >>> Now, as to annotating your code...I'd just put in code comments :) > Or did you mean something else by tagging code in e-mail? > > >>> > > >>> John David N. Dionisio, PhD > > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > > >>> > > >>> > also, how do I tag code in email so it holds its formatting? I > tried a few suggestions I found on the web but they aren't holding > formatting or i'm just doing it wrong ;-D > > >>> > > > >>> > Richard > > >>> > > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > OK, have some updates and some suggestions: > > >>> > > > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene > testing report. We verified that XML match does indeed find 4066 unique > matches - 75 of which are not in the gdb and need to be. > > >>> > > > >>> > Dr. Dahlquist informed me that she was the one who completed the > gene db testing report, not a previous student of BIO367 and had already > verified which genes were missing and where they were to be found. I had > (mistakenly) assumed that since a student had performed the gene database > testing I had to redo all of the verification. > > >>> > > > >>> > So that said, of the 75 genes missing - 69 need to be included and > 6 excluded. > > >>> > Per the gene db testing report: "69 of them have an "a", "b", or > "d" suffix. They are all found in the ORF tag and need to be included in the > gdb." > > >>> > > > >>> > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > > >>> > > > >>> > 1 > > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > > >>> > > > >>> > uniprot/entry/gene/name&type&ordered locus > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > SOLUTIONS: > > >>> > > > >>> > 1. So am i correct in my understanding that the second line is the > query used by TallyEngine to read the XML file? If so then this is the issue > we need to table for the moment until we get the gbd verified and > re-released. We will revisit this to discover why it is not only reporting > incorrectly but also why its added a second row of Ordered Locus on the > TallyEngine results page. > > >>> > > > >>> > 2. The third line is the SQL query used by postgres during the > export from XML to gdb. To find and get the ORF tagged genes could we not > add the following lines and change the count in the first line: > > >>> > > > >>> > > > >>> > # Mycobacterium tuberculosis > > >>> > > > >>> > mycobacteriumtuberculosis_level_amount=2 > > >>> > > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_query_level0= > > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > > >>> > > > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > > >>> > > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > > >>> > Of course these queries would have be manually verified prior to > making these changes but this seems like we are moving in the right > direction. > > >>> > > > >>> > Richard > > >>> > > > >>> > > > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> > wrote: > > >>> > Just got done reading previous email and understand the change in > priority. > > >>> > > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > > >>> > > > >>> > Also great about a more formalized weekly meeting. I was going to > suggest it myself as it has been slow going so far as maybe i'm a bit too > independent in this independent study class =D > > >>> > > > >>> > Will dig further into the missing ID's later tonight and during day > tomorrow and report back. > > >>> > > > >>> > Richard > > >>> > > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > Hi Rich, > > >>> > > > >>> > No problem. The pertinent line you're referring to, for XML, is > this, right above the line you copied: > > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > >>> > The slash-separated section is the "path" of XML tags leading to > the element of interest; then, after the ampersand, is a name/value pair for > the desired attribute to count. Note that there is no hint of a > *content*-based filter (nor is there the capability for one, as far as I can > tell in the code). By "content," I mean that we can't specify filters based > on what's *between* the tags. We can only go as far as filter by attribute > value, e.g., type="ordered locus". > > >>> > > > >>> > But anyway, as mentioned in the earlier e-mail, let's have the > missing IDs in the .gdb take precedence for now. Please take a look at the > tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how > the ID output can be customized, then let me know if you have any questions > or need to confirm anything. > > >>> > > > >>> > John David N. Dionisio, PhD > > >>> > Associate Professor, Computer Science > > >>> > Loyola Marymount University > > >>> > > > >>> > > > >>> > > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > >>> > > > >>> > > Sorry been slammed with a programming assignment that kept > needing continued iteration and it has been all consuming until last night. > But I did get a chance to work with your comments and review the code again > with a different mind set. > > >>> > > > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also > in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > > >>> > > > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select > count(*) from genenametype where type = 'ordered locus' and value like > 'Rv%'; was the db query but don't see which is the XML count... or do they > share the same query and you are saying that XML count doesn't recognize and > therefore cannot use the 'Rv%' parameter? > > >>> > > > > >>> > > Richard > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > Hi Rich, > > >>> > > > > >>> > > Sorry for the delay. Had some distractions coming into the > weekend. > > >>> > > > > >>> > > You've looked at the code; have you looked at > gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as > you were starting to dig into this) > > >>> > > > > >>> > > On the copy I have, the M. tuberculosis block looks like this > (indentation is mine to set it apart): > > >>> > > > > >>> > > # Mycobacterium tuberculosis > > >>> > > mycobacteriumtuberculosis_level_amount=1 > > >>> > > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > >>> > > > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > > >>> > > > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does not > filter on RV%. The SQL query does. > > >>> > > > > >>> > > Unfortunately, I don't think the TallyEngine can include > selective filtering in the XML counts. If the need to do selective > filtering on XML is necessary, then I think we're looking at a new > functionality for you to implement (or, if this throws things off too much, > this may have to be noted somewhere, that the XML vs. database counts may be > off because the database count is doing some text-based filtering but the > XML count does not). > > >>> > > > > >>> > > What does xmlpipedb-match say? That will at least tell you > whether the 'RV%' count is indeed correct. > > >>> > > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > > >>> > > Loyola Marymount University > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > >>> > > > > >>> > > > OK here is what I was able to put together from the past few > hours of code review: > > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > >>> > > > -reveals that after the 2 System table modifications are made > adding species name and link, a PreparedStatement is instantiated which > builds and calls the base query. > > >>> > > > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM > genenametype INNER JOIN entrytype_genetype " + > "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type > = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple > that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? > . > > >>> > > > The 'like' comparator and % usage are clear with the 'type' > entrytype_gene_hjid = ? > > >>> > > > > > >>> > > > -To me it seems the query makes sense so the problem is likely > elsewhere. > > >>> > > > > > >>> > > > GenMappBuilder.java: > > >>> > > > -I found method doTallies() at code line 895 which: > > >>> > > > Instantiates a Configuration called hibernateConfiguration and > assigns to it the current hibernate configuration > > >>> > > > Validates database settings by analyzing hibernateConfiguration > > >>> > > > Instantiates a CriterionList for uniprot and assigns to it > TallyType.UNIPROT > > >>> > > > Instantiates a CriterionList for go and assigns to it > TallyType.GO > > >>> > > > Determines if both xml files exist > > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on > both xml files and their respective CriterionList > > >>> > > > Results are then formatted for display in a table. > > >>> > > > > > >>> > > > -So enum TallyType which means that they are the only valid > datatypes which TallyEngine accepts... go to know ... > > >>> > > > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that > both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly > returning. Likely due to both using an incorrect query (as we previously > supposed). But where are the queries?... the more I dig the more I think > they are in the criterial all the work is done against. > > >>> > > > > > >>> > > > continuing the review: > > >>> > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method > getDbcounts(new QueryEngine(hibernateConfiguration) > > >>> > > > Both of these instanced methods originate from > TallyEngine.java... > > >>> > > > > > >>> > > > TallyEngine.java: > > >>> > > > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a > digester then processes against criteria... but this quickly becomes > confusing and is hard to follow > > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a query but > then I also get a bit lost with my limited db knowledge. > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still > feel as if I'm missing some understanding to proceed further. Can you pass > along some of that Dondi insight and steer me in the right direction? =D > > >>> > > > > > >>> > > > -DB Tally - Not having taken databases yet certainly is > limiting my ability determine where the "criteria" are being set and how > they are followed during session activities. Also is the query we have been > looking for this whole time in the criteria or someplace else? > > >>> > > > > > >>> > > > -XML Tally - again is the query contained within the criteria > that digestXmlFile() uses to parse? > > >>> > > > > > >>> > > > Richard > > >>> > > > > > >>> > > > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > > Right, schema issues are unlikely. Most count discrepancies > like this that I've seen have boiled down to forming the right query. Then, > knowing the right query (in both XML and SQL), it's a matter of making sure > that TallyEngine asks that same query. > > >>> > > > > > >>> > > > John David N. Dionisio, PhD > > >>> > > > Associate Professor, Computer Science > > >>> > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > >>> > > > > > >>> > > > > OK, so based on your approach: > > >>> > > > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match > and sql queries needed for the respective results as you requested. > > >>> > > > > > > >>> > > > > I was also thinking I may need to review the schema from xml > into postgres but the issue isn't likely a schema error. The error most > likely lies in how xmlpipedbutils queries the data from xml source and > writes to the tables what it returns? > > >>> > > > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine > in the gmbuilder code then follow it through the xmlpipedbutils. > > >>> > > > > > > >>> > > > > Richard > > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < > do...@lm...> wrote: > > >>> > > > > Just wanted to confirm (since I wasn't sure in the first > e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils > in SourceForge's Subversion repo. > > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > > >>> > > > > Associate Professor, Computer Science > > >>> > > > > Loyola Marymount University > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > >>> > > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I > was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine > results. > > >>> > > > > > > > >>> > > > > > gmb2b60 Results > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Now the proverbial question - What next to solve the > Ordered Locus import/count issue? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > Here is my thought process: > > >>> > > > > > > > >>> > > > > > Step 1: How does the import process work at the high level? > (obviously correct me if I'm wrong) > > >>> > > > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is > placed in the proper Postgres table(s) based on some criteria. There is also > likely some sort of check that each individual tag is in valid XML format > unless we don't care at this stage (care at export) or maybe the parser just > skips over and goes on to the next . > > >>> > > > > > > > >>> > > > > > Step 2: What could be the problem? > > >>> > > > > > > > >>> > > > > > Either - > > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > >>> > > > > > b. Decision criteria of which table they should be added > to? > > >>> > > > > > > > >>> > > > > > ********************************************** > > >>> > > > > > > > >>> > > > > > I read on the sourceforge wiki: > > >>> > > > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components > that may be used separately or together. XSD-to-DB reads an XSD (XML Schema > Definition) and automatically generates an SQL schema, Java classes, and > Hibernate mappings. XMLPipeDB Utilities provides functionality for > configuring the database, importing data, and performing queries. GenMAPP > Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible > Gene Databases based on data from UniProt and Gene Ontology (GO). > > >>> > > > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities > which are where? I don't see any in the basic distribution or are they not > standalone and called from the command line? > > >>> > > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > > >>> > > > > > Richard > > >>> > > > > > > >>> > > > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------ > > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > > > Pinpoint memory and threading errors before they happen. > > >>> > > > Find and fix more than 250 security defects in the development > cycle. > > >>> > > > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > > _______________________________________________ > > >>> > > > xmlpipedb-developer mailing list > > >>> > > > xml...@li... > > >>> > > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > > >>> > > > <ATT00001..txt><ATT00002..txt> > > >>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------------------ > > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > > Pinpoint memory and threading errors before they happen. > > >>> > > Find and fix more than 250 security defects in the development > cycle. > > >>> > > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > > _______________________________________________ > > >>> > > xmlpipedb-developer mailing list > > >>> > > xml...@li... > > >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > > >>> > > <ATT00001..txt><ATT00002..txt> > > >>> > > > >>> > > > >>> > > ------------------------------------------------------------------------------ > > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > > >>> > Pinpoint memory and threading errors before they happen. > > >>> > Find and fix more than 250 security defects in the development > cycle. > > >>> > Locate bottlenecks in serial and parallel code that limit > performance. > > >>> > http://p.sf.net/sfu/intel-dev2devfeb > > >>> > _______________________________________________ > > >>> > xmlpipedb-developer mailing list > > >>> > xml...@li... > > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > > >>> > > > >>> > > > >>> > <ATT00001..txt><ATT00002..txt> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > > >>> Pinpoint memory and threading errors before they happen. > > >>> Find and fix more than 250 security defects in the development cycle. > > >>> Locate bottlenecks in serial and parallel code that limit > performance. > > >>> http://p.sf.net/sfu/intel-dev2devfeb > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>> > > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > > >>> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >>> Collect, index and harness all the fast moving IT data generated by > your > > >>> applications, servers and devices whether physical, virtual or in the > cloud. > > >>> Deliver compliance at lower cost and gain new business insights. > > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >>> _______________________________________________ > > >>> xmlpipedb-developer mailing list > > >>> xml...@li... > > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >> Collect, index and harness all the fast moving IT data generated by > your > > >> applications, servers and devices whether physical, virtual or in the > cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > ------------------------------------------------------------------------------ > > >> Index, Search & Analyze Logs and other IT data in Real-Time with > Splunk > > >> Collect, index and harness all the fast moving IT data generated by > your > > >> applications, servers and devices whether physical, virtual or in the > cloud. > > >> Deliver compliance at lower cost and gain new business insights. > > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > ------------------------------------------------------------------------------ > > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > > Collect, index and harness all the fast moving IT data generated by > your > > > applications, servers and devices whether physical, virtual or in the > cloud. > > > Deliver compliance at lower cost and gain new business insights. > > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > Collect, index and harness all the fast moving IT data generated by your > > applications, servers and devices whether physical, virtual or in the > cloud. > > Deliver compliance at lower cost and gain new business insights. > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-02-22 07:07:04
|
Hi Rich, Addressing the release business first, let's put it this way: if the remaining loose ends can be addressed by tomorrow, we can probably wait until then. If unexpected snags are encountered, then it would be worthwhile to release whatever you have. With that in mind, considering that you pretty much know the patterns of the IDs that are needed, I think it will only take a little digital forensic work now to figure out exactly which IDs are still needed. Once you know what those are, you should: 1. Find where they are in the XML file. 2. Knowing the XML location, find the corresponding table in the relational database (table names are generally derived from tag/element names). 3. Knowing the table in the database, write or extend the SpeciesProfile query to retrieve that data. For the ID that must *not* be included, again it's a matter of tracking down what this ID is. Knowing this straggler, you can then consult with Dr. Dahlquist if this ID is truly a unique one-off, or is representative of a pattern that we'll want to exclude. Either way, this ID can be omitted by using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL where clause syntax to see where the negation can be applied). John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > actually i had a typo (emailing from desktop system but testing on my laptop... typed correctly here but wrong in pgadmin) but the results make much more sense now. > > > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; > returns 4058 > > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > > -------------------------------------------------------------------------------------------------------------------------- > > > Continuing forward - > > The testing report says that 4066 unique matches exist in XML but 6 of them were eliminated by Dr. D leaving the desired number at 4060. > > So now we are only 2 genes short with the query returning 4058... which is also (conveniently) the sum of the two separate queries of 'ordered locus' and 'ORF' respectively. > > But recall that Dr. D said that only 69 genes of the missing 75 were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding that into missing genes puts us 3 short... > > Should I make the changes to the code and export a gdb so that analysis can be done or wait until we work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The second form should have worked actually. What exactly was the error? > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > > > hmm not taking parenthesis where I thought they should go... syntax error > > > > select count (*) from genenametype where type = ('ordered locus' or 'ORF') and value like 'Rv%'; > > also tried > > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> wrote: > > Watch your parentheses: "and" has greater precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >> select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > >> returns 70 > >> > >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > >> > >> select count (*) from genenametype where type = 'ordered locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > >> select count (*) from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were were looking for plus one more (maybe the count for missing genes is off by 1?). > >> > >> 2. Combining the two types in a single query did not produce the results that I expected (7011? - how did that happen????) so this is likely not our solution... unless of course the query syntax isn't actually doing what I think it is... > >> > >> 3. I would think the best course of action is to serialy run two separate queries to capture all the required genes, then removing the one unneeded gene if its truly not wanted. > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I don't recall the exact details of the missing 69, but if your query successfully returns them in raw SQL, then this is worth a try. You can integrate into the same query as long as the same columns are returned, which is the case here AFAIK, so go ahead and extend the existing query. > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > >>> TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { > >>> > >>> // Build the base query; we only use "ordered locus" and we only want > >>> > >>> // IDs that begin with "Rv." > >>> PreparedStatement ps = ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually want to keep the case where multiple ordered locus > >>> > >>> // names appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > >>> for (String id : result.getString("value").split("/")) { > >>> tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >>> So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. > >>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. > >>> > >>> Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>> thanks and will do as directed. > >>> > >>> My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? > >>> > >>> <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. > >>> > >>> Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. > >>> > >>> However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) > >>> > >>> - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. > >>> > >>> - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. > >>> > >>> These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. > >>> > >>> Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? > >>> > >>> John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. > >>> > > >>> > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. > >>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." > >>> > > >>> > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. > >>> > > >>> > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: > >>> > > >>> > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > >>> > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) from genenametype where type = 'ordered locus'; > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; > >>> > > >>> > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > >>> > ---------------------------------------------------------------------------------------------------------------------------- > >>> > > >>> > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just got done reading previous email and understand the change in priority. > >>> > > >>> > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D > >>> > > >>> > Will dig further into the missing ID's later tonight and during day tomorrow and report back. > >>> > > >>> > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: > >>> > > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". > >>> > > >>> > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > Associate Professor, Computer Science > >>> > Loyola Marymount University > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. > >>> > > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. > >>> > > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the delay. Had some distractions coming into the weekend. > >>> > > > >>> > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): > >>> > > > >>> > > # Mycobacterium tuberculosis > >>> > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > >>> > > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > >>> > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). > >>> > > > >>> > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was able to put together from the past few hours of code review: > >>> > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. > >>> > > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > > -To me it seems the query makes sense so the problem is likely elsewhere. > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method doTallies() at code line 895 which: > >>> > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration > >>> > > > Validates database settings by analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines if both xml files exist > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList > >>> > > > Results are then formatted for display in a table. > >>> > > > > >>> > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. > >>> > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced methods originate from TallyEngine.java... > >>> > > > > >>> > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow > >>> > > > > >>> > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. > >>> > > > > >>> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: > >>> > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. > >>> > > > > >>> > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. > >>> > > > > > >>> > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate Professor, Computer Science > >>> > > > > Loyola Marymount University > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > >>> > > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? > >>> > > > > > > >>> > > > > > ********************************************** > >>> > > > > > Here is my thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . > >>> > > > > > > >>> > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table they should be added to? > >>> > > > > > > >>> > > > > > ********************************************** > >>> > > > > > > >>> > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > >>> > > > > > > >>> > > > > > Richard > >>> > > > > > >>> > > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> > >>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------------------ > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > >>> > > > Pinpoint memory and threading errors before they happen. > >>> > > > Find and fix more than 250 security defects in the development cycle. > >>> > > > Locate bottlenecks in serial and parallel code that limit performance. > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb > >>> > > > _______________________________________________ > >>> > > > xmlpipedb-developer mailing list > >>> > > > xml...@li... > >>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > > > >>> > > > <ATT00001..txt><ATT00002..txt> > >>> > > > >>> > > > >>> > > ------------------------------------------------------------------------------ > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > >>> > > Pinpoint memory and threading errors before they happen. > >>> > > Find and fix more than 250 security defects in the development cycle. > >>> > > Locate bottlenecks in serial and parallel code that limit performance. > >>> > > http://p.sf.net/sfu/intel-dev2devfeb > >>> > > _______________________________________________ > >>> > > xmlpipedb-developer mailing list > >>> > > xml...@li... > >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > > >>> > > <ATT00001..txt><ATT00002..txt> > >>> > > >>> > > >>> > ------------------------------------------------------------------------------ > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > >>> > Pinpoint memory and threading errors before they happen. > >>> > Find and fix more than 250 security defects in the development cycle. > >>> > Locate bottlenecks in serial and parallel code that limit performance. > >>> > http://p.sf.net/sfu/intel-dev2devfeb > >>> > _______________________________________________ > >>> > xmlpipedb-developer mailing list > >>> > xml...@li... > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > >>> > > >>> > > >>> > <ATT00001..txt><ATT00002..txt> > >>> > >>> > >>> ------------------------------------------------------------------------------ > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > >>> Pinpoint memory and threading errors before they happen. > >>> Find and fix more than 250 security defects in the development cycle. > >>> Locate bottlenecks in serial and parallel code that limit performance. > >>> http://p.sf.net/sfu/intel-dev2devfeb > >>> _______________________________________________ > >>> xmlpipedb-developer mailing list > >>> xml...@li... > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ > >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >>> Collect, index and harness all the fast moving IT data generated by your > >>> applications, servers and devices whether physical, virtual or in the cloud. > >>> Deliver compliance at lower cost and gain new business insights. > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >>> _______________________________________________ > >>> xmlpipedb-developer mailing list > >>> xml...@li... > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> ------------------------------------------------------------------------------ > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >> Collect, index and harness all the fast moving IT data generated by your > >> applications, servers and devices whether physical, virtual or in the cloud. > >> Deliver compliance at lower cost and gain new business insights. > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > >> ------------------------------------------------------------------------------ > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >> Collect, index and harness all the fast moving IT data generated by your > >> applications, servers and devices whether physical, virtual or in the cloud. > >> Deliver compliance at lower cost and gain new business insights. > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > Collect, index and harness all the fast moving IT data generated by your > > applications, servers and devices whether physical, virtual or in the cloud. > > Deliver compliance at lower cost and gain new business insights. > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-02-22 06:37:29
|
actually i had a typo (emailing from desktop system but testing on my laptop... typed correctly here but wrong in pgadmin) but the results make much more sense now. select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; returns 4058 select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; returns 4058 -------------------------------------------------------------------------------------------------------------------------- Continuing forward - The testing report says that 4066 unique matches exist in XML but 6 of them were eliminated by Dr. D leaving the desired number at 4060. So now we are only 2 genes short with the query returning 4058... which is also (conveniently) the sum of the two separate queries of 'ordered locus' and 'ORF' respectively. But recall that Dr. D said that only 69 genes of the missing 75 were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding that into missing genes puts us 3 short... Should I make the changes to the code and export a gdb so that analysis can be done or wait until we work this through further? Richard On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...>wrote: > Hi Rich, > > The second form should have worked actually. What exactly was the error? > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > > > hmm not taking parenthesis where I thought they should go... syntax error > > > > select count (*) from genenametype where type = ('ordered locus' or > 'ORF') and value like 'Rv%'; > > also tried > > select count (*) from genenametype where (type = 'ordered locus' or type > = 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> > wrote: > > ah yes... i see it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> > wrote: > > Watch your parentheses: "and" has greater precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > >> returns 3988 > >> > >> select count (*) from genenametype where type = 'ORF' and value like > 'Rv%'; > >> returns 70 > >> > >> select count (*) from genenametype where type = 'ordered locus' or type > = 'ORF' and value like 'Rv%'; > >> returns 7011 > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > >> > >> select count (*) from genenametype where type = 'ordered locus' and > value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> select count (*) from genenametype where type = 'ordered locus' or type > = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes > were were looking for plus one more (maybe the count for missing genes is > off by 1?). > >> > >> 2. Combining the two types in a single query did not produce the results > that I expected (7011? - how did that happen????) so this is likely not our > solution... unless of course the query syntax isn't actually doing what I > think it is... > >> > >> 3. I would think the best course of action is to serialy run two > separate queries to capture all the required genes, then removing the one > unneeded gene if its truly not wanted. > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> > wrote: > >> I don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > >>> > >>> // Build the base query; we only use "ordered locus" and we only want > >>> > >>> // IDs that begin with "Rv." > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and > entrytype_gene_hjid = ?"); > >>> ResultSet result; > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually want to keep the case where multiple ordered locus > >>> > >>> // names appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > >>> for (String id : result.getString("value").split("/")) { > >>> tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { > "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", > GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", > row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there are more comprehensive ways to search for gene ID's by > matching gene ID prefix but "like Rv%" seemed to work thus far, we just need > to tell it to search for XML tag type orf in addition to ordered locus. > >>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part of our class was right as the server > went down and my submission for week 6 assignment I can't seem to find. > >>> > >>> Is it possible to have two different types in the same query or should > we rewrite a separate query for the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> > wrote: > >>> > >>> thanks and will do as directed. > >>> > >>> My previous, last paragraph comment - A way for programming code in > email holding its format in a mail message similarly to how you can post > code on forum pages? > >>> > >>> <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> Actually, gmbuilder.properties is for the TallyEngine only. When > dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to > find those 69 IDs, it is the SpeciesProfile code, and *only* the > SpeciesProfile code, that needs to be changed. > >>> > >>> Your take on how gmbuilder.properties is used, however, is > understandable. It makes sense to assume that the TallyEngine code *and* > the ID export code are based on the same characterization of the needed IDs. > This replication is originally a historical artifact: SpeciesProfile was > done first, and then TallyEngine was done later by another student. > >>> > >>> However, there are other factors beyond history that sort of > necessitate this duplication of desired IDs: (skip the two bullets below if > you'd rather cut to the chase of the work to be done, and discuss design > issues later) > >>> > >>> - The actual XML import code is a black box: this is the "canned" JAXB > library actually in action, and not our code at all. Plus, the XML import > code really does not filter (nor should it), since the goal of the > XML->relational database step is to fully capture the XML data in the > relational database. So, XML count is necessarily separated from XML > import. > >>> > >>> - The notion of a declarative mechanism for extracting IDs from the > relational database (which is what gmbuilder.properties/TallyEngine uses) is > interesting, but at the same time there is value in the arbitrary > computation that can be done with Java (case in point: export two versions > of an ID, with and without periods). This is not to say that it is > impossible to do this declaratively, but let's just say that the procedural > approach exists here and now, and a declarative approach will need more > thought. > >>> > >>> These, and other factors, are good thoughts to hold onto and would be > worthy of a good meeting discussion sometime, but bottom line for now: > modifying the export behavior is a matter of editing the *SpeciesProfile* > Java code, and not the gmbuilder.properties file. Turn your attention to > that code. > >>> > >>> Now, as to annotating your code...I'd just put in code comments :) Or > did you mean something else by tagging code in e-mail? > >>> > >>> John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag code in email so it holds its formatting? I tried > a few suggestions I found on the web but they aren't holding formatting or > i'm just doing it wrong ;-D > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> > wrote: > >>> > OK, have some updates and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing > report. We verified that XML match does indeed find 4066 unique matches - 75 > of which are not in the gdb and need to be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who completed the gene > db testing report, not a previous student of BIO367 and had already verified > which genes were missing and where they were to be found. I had (mistakenly) > assumed that since a student had performed the gene database testing I had > to redo all of the verification. > >>> > > >>> > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > >>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" > suffix. They are all found in the ORF tag and need to be included in the > gdb." > >>> > > >>> > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) from genenametype where type = 'ordered locus' and > value like 'Rv%'; > >>> > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my understanding that the second line is the > query used by TallyEngine to read the XML file? If so then this is the issue > we need to table for the moment until we get the gbd verified and > re-released. We will revisit this to discover why it is not only reporting > incorrectly but also why its added a second row of Ordered Locus on the > TallyEngine results page. > >>> > > >>> > 2. The third line is the SQL query used by postgres during the export > from XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > >>> > > >>> > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) from genenametype where type = 'ordered locus'; > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > >>> > > >>> > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > >>> > > >>> > Of course these queries would have be manually verified prior to > making these changes but this seems like we are moving in the right > direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> > wrote: > >>> > Just got done reading previous email and understand the change in > priority. > >>> > > >>> > Will work on the missing ID's for now and shelve the the TalleyEngine > issue for the moment. > >>> > > >>> > Also great about a more formalized weekly meeting. I was going to > suggest it myself as it has been slow going so far as maybe i'm a bit too > independent in this independent study class =D > >>> > > >>> > Will dig further into the missing ID's later tonight and during day > tomorrow and report back. > >>> > > >>> > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No problem. The pertinent line you're referring to, for XML, is > this, right above the line you copied: > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > >>> > The slash-separated section is the "path" of XML tags leading to the > element of interest; then, after the ampersand, is a name/value pair for the > desired attribute to count. Note that there is no hint of a *content*-based > filter (nor is there the capability for one, as far as I can tell in the > code). By "content," I mean that we can't specify filters based on what's > *between* the tags. We can only go as far as filter by attribute value, > e.g., type="ordered locus". > >>> > > >>> > But anyway, as mentioned in the earlier e-mail, let's have the > missing IDs in the .gdb take precedence for now. Please take a look at the > tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how > the ID output can be customized, then let me know if you have any questions > or need to confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > Associate Professor, Computer Science > >>> > Loyola Marymount University > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment that kept needing > continued iteration and it has been all consuming until last night. But I > did get a chance to work with your comments and review the code again with a > different mind set. > >>> > > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also > in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > >>> > > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select > count(*) from genenametype where type = 'ordered locus' and value like > 'Rv%'; was the db query but don't see which is the XML count... or do they > share the same query and you are saying that XML count doesn't recognize and > therefore cannot use the 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the delay. Had some distractions coming into the > weekend. > >>> > > > >>> > > You've looked at the code; have you looked at gmbuilder.properties? > (I may have mentioned it a few e-mails ago, just as you were starting to > dig into this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block looks like this > (indentation is mine to set it apart): > >>> > > > >>> > > # Mycobacterium tuberculosis > >>> > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) from > genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > >>> > > There, I think, is the rub. Notice that the XML count does not > filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, I don't think the TallyEngine can include selective > filtering in the XML counts. If the need to do selective filtering on XML > is necessary, then I think we're looking at a new functionality for you to > implement (or, if this throws things off too much, this may have to be noted > somewhere, that the XML vs. database counts may be off because the database > count is doing some text-based filtering but the XML count does not). > >>> > > > >>> > > What does xmlpipedb-match say? That will at least tell you whether > the 'RV%' count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was able to put together from the past few > hours of code review: > >>> > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > -reveals that after the 2 System table modifications are made > adding species name and link, a PreparedStatement is instantiated which > builds and calls the base query. > >>> > > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM > genenametype INNER JOIN entrytype_genetype " + > "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type > = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple > that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? > . > >>> > > > The 'like' comparator and % usage are clear with the 'type' > entrytype_gene_hjid = ? > >>> > > > > >>> > > > -To me it seems the query makes sense so the problem is likely > elsewhere. > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method doTallies() at code line 895 which: > >>> > > > Instantiates a Configuration called hibernateConfiguration and > assigns to it the current hibernate configuration > >>> > > > Validates database settings by analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList for uniprot and assigns to it > TallyType.UNIPROT > >>> > > > Instantiates a CriterionList for go and assigns to it > TallyType.GO > >>> > > > Determines if both xml files exist > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on > both xml files and their respective CriterionList > >>> > > > Results are then formatted for display in a table. > >>> > > > > >>> > > > -So enum TallyType which means that they are the only valid > datatypes which TallyEngine accepts... go to know ... > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that both > getTallyResultsXML() and getTallyResultsDatabase() are incorrectly > returning. Likely due to both using an incorrect query (as we previously > supposed). But where are the queries?... the more I dig the more I think > they are in the criterial all the work is done against. > >>> > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method > getDbcounts(new QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced methods originate from > TallyEngine.java... > >>> > > > > >>> > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a > digester then processes against criteria... but this quickly becomes > confusing and is hard to follow > >>> > > > > >>> > > > getDbcounts() then starts a db session and executes a query but > then I also get a bit lost with my limited db knowledge. > >>> > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still feel > as if I'm missing some understanding to proceed further. Can you pass along > some of that Dondi insight and steer me in the right direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken databases yet certainly is limiting > my ability determine where the "criteria" are being set and how they are > followed during session activities. Also is the query we have been looking > for this whole time in the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again is the query contained within the criteria > that digestXmlFile() uses to parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > > Right, schema issues are unlikely. Most count discrepancies like > this that I've seen have boiled down to forming the right query. Then, > knowing the right query (in both XML and SQL), it's a matter of making sure > that TallyEngine asks that same query. > >>> > > > > >>> > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match > and sql queries needed for the respective results as you requested. > >>> > > > > > >>> > > > > I was also thinking I may need to review the schema from xml > into postgres but the issue isn't likely a schema error. The error most > likely lies in how xmlpipedbutils queries the data from xml source and > writes to the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine in > the gmbuilder code then follow it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > > > Just wanted to confirm (since I wasn't sure in the first > e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils > in SourceForge's Subversion repo. > >>> > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate Professor, Computer Science > >>> > > > > Loyola Marymount University > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > >>> > > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I was > able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > Now the proverbial question - What next to solve the Ordered > Locus import/count issue? > >>> > > > > > > >>> > > > > > ********************************************** > >>> > > > > > Here is my thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the import process work at the high level? > (obviously correct me if I'm wrong) > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is > placed in the proper Postgres table(s) based on some criteria. There is also > likely some sort of check that each individual tag is in valid XML format > unless we don't care at this stage (care at export) or maybe the parser just > skips over and goes on to the next . > >>> > > > > > > >>> > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table they should be added to? > >>> > > > > > > >>> > > > > > ********************************************** > >>> > > > > > > >>> > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components > that may be used separately or together. XSD-to-DB reads an XSD (XML Schema > Definition) and automatically generates an SQL schema, Java classes, and > Hibernate mappings. XMLPipeDB Utilities provides functionality for > configuring the database, importing data, and performing queries. GenMAPP > Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible > Gene Databases based on data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities which > are where? I don't see any in the basic distribution or are they not > standalone and called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > >>> > > > > > > >>> > > > > > Richard > >>> > > > > > >>> > > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> > >>> > > > > >>> > > > > >>> > > > > ------------------------------------------------------------------------------ > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > >>> > > > Pinpoint memory and threading errors before they happen. > >>> > > > Find and fix more than 250 security defects in the development > cycle. > >>> > > > Locate bottlenecks in serial and parallel code that limit > performance. > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb > >>> > > > _______________________________________________ > >>> > > > xmlpipedb-developer mailing list > >>> > > > xml...@li... > >>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > > > >>> > > > <ATT00001..txt><ATT00002..txt> > >>> > > > >>> > > > >>> > > > ------------------------------------------------------------------------------ > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel > Studio XE: > >>> > > Pinpoint memory and threading errors before they happen. > >>> > > Find and fix more than 250 security defects in the development > cycle. > >>> > > Locate bottlenecks in serial and parallel code that limit > performance. > >>> > > http://p.sf.net/sfu/intel-dev2devfeb > >>> > > _______________________________________________ > >>> > > xmlpipedb-developer mailing list > >>> > > xml...@li... > >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > > >>> > > <ATT00001..txt><ATT00002..txt> > >>> > > >>> > > >>> > > ------------------------------------------------------------------------------ > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > >>> > Pinpoint memory and threading errors before they happen. > >>> > Find and fix more than 250 security defects in the development cycle. > >>> > Locate bottlenecks in serial and parallel code that limit > performance. > >>> > http://p.sf.net/sfu/intel-dev2devfeb > >>> > _______________________________________________ > >>> > xmlpipedb-developer mailing list > >>> > xml...@li... > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > > >>> > > >>> > > >>> > <ATT00001..txt><ATT00002..txt> > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > >>> Pinpoint memory and threading errors before they happen. > >>> Find and fix more than 250 security defects in the development cycle. > >>> Locate bottlenecks in serial and parallel code that limit performance. > >>> http://p.sf.net/sfu/intel-dev2devfeb > >>> _______________________________________________ > >>> xmlpipedb-developer mailing list > >>> xml...@li... > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >>> Collect, index and harness all the fast moving IT data generated by > your > >>> applications, servers and devices whether physical, virtual or in the > cloud. > >>> Deliver compliance at lower cost and gain new business insights. > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >>> _______________________________________________ > >>> xmlpipedb-developer mailing list > >>> xml...@li... > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > ------------------------------------------------------------------------------ > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >> Collect, index and harness all the fast moving IT data generated by your > >> applications, servers and devices whether physical, virtual or in the > cloud. > >> Deliver compliance at lower cost and gain new business insights. > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > >> > ------------------------------------------------------------------------------ > >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > >> Collect, index and harness all the fast moving IT data generated by your > >> applications, servers and devices whether physical, virtual or in the > cloud. > >> Deliver compliance at lower cost and gain new business insights. > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > > Collect, index and harness all the fast moving IT data generated by your > > applications, servers and devices whether physical, virtual or in the > cloud. > > Deliver compliance at lower cost and gain new business insights. > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-02-22 06:04:59
|
Hi Rich, The second form should have worked actually. What exactly was the error? John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: > hmm not taking parenthesis where I thought they should go... syntax error > > select count (*) from genenametype where type = ('ordered locus' or 'ORF') and value like 'Rv%'; > also tried > select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; > > > > > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > ah yes... i see it... > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...> wrote: > Watch your parentheses: "and" has greater precedence than "or" :) > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > >> OK, so here are my query results from raw SQL: >> >> 1. using: like 'Rv%' >> >> select count (*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >> returns 3988 >> >> select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; >> returns 70 >> >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value like 'Rv%'; >> returns 7011 >> >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >> >> select count (*) from genenametype where type = 'ordered locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 3988 >> >> select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 7011 >> >> select count (*) from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 70 >> >> Conclusions: >> >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were were looking for plus one more (maybe the count for missing genes is off by 1?). >> >> 2. Combining the two types in a single query did not produce the results that I expected (7011? - how did that happen????) so this is likely not our solution... unless of course the query syntax isn't actually doing what I think it is... >> >> 3. I would think the best course of action is to serialy run two separate queries to capture all the required genes, then removing the one unneeded gene if its truly not wanted. >> >> What do you think? >> >> Richard >> >> >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: >> I don't recall the exact details of the missing 69, but if your query successfully returns them in raw SQL, then this is worth a try. You can integrate into the same query as long as the same columns are returned, which is the case here AFAIK, so go ahead and extend the existing query. >> >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: >> >>> So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> public >>> >>> TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { >>> >>> // Build the base query; we only use "ordered locus" and we only want >>> >>> // IDs that begin with "Rv." >>> PreparedStatement ps = ConnectionManager.getRelationalDBConnection().prepareStatement( >>> >>> "SELECT value, type " + >>> >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>> >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>> >>> "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); >>> ResultSet result; >>> >>> >>> >>> for (Row row : primarySystemTableManager.getRows()) { >>> ps.setInt(1, Integer.parseInt(row.getValue( >>> >>> "UID"))); >>> result = ps.executeQuery(); >>> >>> >>> >>> // We actually want to keep the case where multiple ordered locus >>> >>> // names appear. >>> >>> while (result.next()) { >>> >>> // We want this name to appear in the OrderedLocusNames >>> >>> // system table. >>> >>> for (String id : result.getString("value").split("/")) { >>> tableManager.submit( >>> >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", row.getValue("UID") } }); >>> } >>> >>> } >>> >>> } >>> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". >>> >>> I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. >>> >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " >>> >>> Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. >>> >>> Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? >>> >>> Richard >>> >>> >>> >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: >>> >>> thanks and will do as directed. >>> >>> My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? >>> >>> <code> >>> blah >>> blah >>> blah >>> </code> >>> >>> thanks! >>> >>> Richard >>> >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...> wrote: >>> >>> Greetings, >>> >>> Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. >>> >>> Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. >>> >>> However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) >>> >>> - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. >>> >>> - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. >>> >>> These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. >>> >>> Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? >>> >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> >>> >>> >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>> >>> > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D >>> > >>> > Richard >>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: >>> > OK, have some updates and some suggestions: >>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. >>> > >>> > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. >>> > >>> > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. >>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." >>> > >>> > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: >>> > # Mycobacterium tuberculosis >>> > >>> > mycobacteriumtuberculosis_level_amount= >>> > >>> > 1 >>> > >>> > mycobacteriumtuberculosis_element_level0= >>> > >>> > uniprot/entry/gene/name&type&ordered locus >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > Ordered Locus >>> > SOLUTIONS: >>> > >>> > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. >>> > >>> > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: >>> > >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>> > >>> > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > Ordered Locus >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>> > >>> > ---------------------------------------------------------------------------------------------------------------------------- >>> > >>> > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. >>> > >>> > Richard >>> > >>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: >>> > Just got done reading previous email and understand the change in priority. >>> > >>> > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. >>> > >>> > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D >>> > >>> > Will dig further into the missing ID's later tonight and during day tomorrow and report back. >>> > >>> > Richard >>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: >>> > Hi Rich, >>> > >>> > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: >>> > >>> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > >>> > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". >>> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> > >>> > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. >>> > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. >>> > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? >>> > > >>> > > Richard >>> > > >>> > > >>> > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: >>> > > Hi Rich, >>> > > >>> > > Sorry for the delay. Had some distractions coming into the weekend. >>> > > >>> > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) >>> > > >>> > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): >>> > > >>> > > # Mycobacterium tuberculosis >>> > > mycobacteriumtuberculosis_level_amount=1 >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >>> > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > > >>> > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. >>> > > >>> > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). >>> > > >>> > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > > >>> > > > OK here is what I was able to put together from the past few hours of code review: >>> > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. >>> > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>> > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>> > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? >>> > > > >>> > > > -To me it seems the query makes sense so the problem is likely elsewhere. >>> > > > >>> > > > GenMappBuilder.java: >>> > > > -I found method doTallies() at code line 895 which: >>> > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration >>> > > > Validates database settings by analyzing hibernateConfiguration >>> > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT >>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>> > > > Determines if both xml files exist >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList >>> > > > Results are then formatted for display in a table. >>> > > > >>> > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... >>> > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. >>> > > > >>> > > > continuing the review: >>> > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) >>> > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) >>> > > > Both of these instanced methods originate from TallyEngine.java... >>> > > > >>> > > > TallyEngine.java: >>> > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow >>> > > > >>> > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. >>> > > > >>> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D >>> > > > >>> > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? >>> > > > >>> > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? >>> > > > >>> > > > Richard >>> > > > >>> > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: >>> > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. >>> > > > >>> > > > John David N. Dionisio, PhD >>> > > > Associate Professor, Computer Science >>> > > > Loyola Marymount University >>> > > > >>> > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > > > >>> > > > > OK, so based on your approach: >>> > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. >>> > > > > >>> > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? >>> > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >>> > > > > >>> > > > > Richard >>> > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: >>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. >>> > > > > >>> > > > > John David N. Dionisio, PhD >>> > > > > Associate Professor, Computer Science >>> > > > > Loyola Marymount University >>> > > > > >>> > > > > >>> > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > > > > >>> > > > > > Hi Dondi, >>> > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>> > > > > > >>> > > > > > gmb2b60 Results >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? >>> > > > > > >>> > > > > > ********************************************** >>> > > > > > Here is my thought process: >>> > > > > > >>> > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) >>> > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . >>> > > > > > >>> > > > > > Step 2: What could be the problem? >>> > > > > > >>> > > > > > Either - >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>> > > > > > b. Decision criteria of which table they should be added to? >>> > > > > > >>> > > > > > ********************************************** >>> > > > > > >>> > > > > > I read on the sourceforge wiki: >>> > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). >>> > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? >>> > > > > > >>> > > > > > Thanks! >>> > > > > > >>> > > > > > Richard >>> > > > > >>> > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> >>> > > > >>> > > > >>> > > > ------------------------------------------------------------------------------ >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> > > > Pinpoint memory and threading errors before they happen. >>> > > > Find and fix more than 250 security defects in the development cycle. >>> > > > Locate bottlenecks in serial and parallel code that limit performance. >>> > > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > > _______________________________________________ >>> > > > xmlpipedb-developer mailing list >>> > > > xml...@li... >>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > > >>> > > > <ATT00001..txt><ATT00002..txt> >>> > > >>> > > >>> > > ------------------------------------------------------------------------------ >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> > > Pinpoint memory and threading errors before they happen. >>> > > Find and fix more than 250 security defects in the development cycle. >>> > > Locate bottlenecks in serial and parallel code that limit performance. >>> > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > _______________________________________________ >>> > > xmlpipedb-developer mailing list >>> > > xml...@li... >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > >>> > > <ATT00001..txt><ATT00002..txt> >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> > Pinpoint memory and threading errors before they happen. >>> > Find and fix more than 250 security defects in the development cycle. >>> > Locate bottlenecks in serial and parallel code that limit performance. >>> > http://p.sf.net/sfu/intel-dev2devfeb >>> > _______________________________________________ >>> > xmlpipedb-developer mailing list >>> > xml...@li... >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > >>> > >>> > >>> > <ATT00001..txt><ATT00002..txt> >>> >>> >>> ------------------------------------------------------------------------------ >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> Pinpoint memory and threading errors before they happen. >>> Find and fix more than 250 security defects in the development cycle. >>> Locate bottlenecks in serial and parallel code that limit performance. >>> http://p.sf.net/sfu/intel-dev2devfeb >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >>> Collect, index and harness all the fast moving IT data generated by your >>> applications, servers and devices whether physical, virtual or in the cloud. >>> Deliver compliance at lower cost and gain new business insights. >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-02-22 06:01:09
|
hmm not taking parenthesis where I thought they should go... syntax error select count (*) from genenametype where type = ('ordered locus' or 'ORF') and value like 'Rv%'; also tried select count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') and value like 'Rv%'; On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > ah yes... i see it... > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...>wrote: > >> Watch your parentheses: "and" has greater precedence than "or" :) >> >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: >> >> OK, so here are my query results from raw SQL: >> >> 1. using: like 'Rv%' >> >> select count (*) from genenametype where type = 'ordered locus' and value >> like 'Rv%'; >> returns 3988 >> >> select count (*) from genenametype where type = 'ORF' and value like >> 'Rv%'; >> returns 70 >> >> select count (*) from genenametype where type = 'ordered locus' or type = >> 'ORF' and value like 'Rv%'; >> returns 7011 >> >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >> >> select count (*) from genenametype where type = 'ordered locus' and value >> ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 3988 >> >> select count (*) from genenametype where type = 'ordered locus' or type = >> 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 7011 >> >> select count (*) from genenametype where type = 'ORF' and value ~ >> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> returns 70 >> >> Conclusions: >> >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes >> were were looking for plus one more (maybe the count for missing genes is >> off by 1?). >> >> 2. Combining the two types in a single query did not produce the results >> that I expected (7011? - how did that happen????) so this is likely not our >> solution... unless of course the query syntax isn't actually doing what I >> think it is... >> >> 3. I would think the best course of action is to serialy run two separate >> queries to capture all the required genes, then removing the one unneeded >> gene if its truly not wanted. >> >> What do you think? >> >> Richard >> >> >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...>wrote: >> >>> I don't recall the exact details of the missing 69, but if your query >>> successfully returns them in raw SQL, then this is worth a try. You can >>> integrate into the same query as long as the same columns are returned, >>> which is the case here AFAIK, so go ahead and extend the existing query. >>> >>> >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: >>> >>> So here is the appropriate code snippet from >>> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> >>> * >>> >>> public >>> *TableManager getSystemTableManagerCustomizations(TableManager >>> tableManager, TableManager primarySystemTableManager, Date version) * >>> throws* SQLException, InvalidParameterException { >>> >>> // Build the base query; we only use "ordered locus" and we only want >>> >>> // IDs that begin with "*Rv*." >>> >>> PreparedStatement ps = ConnectionManager.*getRelationalDBConnection* >>> ().prepareStatement( >>> "SELECT value, type " + >>> >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>> >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>> >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>> entrytype_gene_hjid = ?"); >>> >>> ResultSet result; >>> >>> *for* (Row row : primarySystemTableManager.getRows()) { >>> >>> ps.setInt(1, Integer.*parseInt*(row.getValue( >>> "UID"))); >>> >>> result = ps.executeQuery(); >>> >>> // We actually want to keep the case where multiple ordered locus >>> >>> // names appear. >>> >>> *while* (result.next()) { >>> >>> // We want this name to appear in the OrderedLocusNames >>> >>> // system table. >>> >>> *for* (String id : result.getString("value").split("/")) { >>> >>> tableManager.submit( >>> "OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id >>> }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", >>> GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID", >>> row.getValue("UID") } }); >>> >>> } >>> >>> } >>> >>> } >>> >>> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> So now we want to build the base query which uses "ordered locus" and >>> "orf" and we only want IDs that begin with "Rv". >>> >>> I know there are more comprehensive ways to search for gene ID's >>> by matching gene ID prefix but "like Rv%" seemed to work thus far, we just >>> need to tell it to search for XML tag type orf in addition to ordered locus. >>> >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and >>> entrytype_gene_hjid = ? " >>> >>> Here is a stab at it.... This part of our class was right as the server >>> went down and my submission for week 6 assignment I can't seem to find. >>> >>> Is it possible to have two different types in the same query or should we >>> rewrite a separate query for the orf tag? >>> >>> Richard >>> >>> >>> >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous < <rbr...@gm...> >>> rbr...@gm...> wrote: >>> >>> >>>> thanks and will do as directed. >>>> >>>> My previous, last paragraph comment - A way for programming code in >>>> email holding its format in a mail message similarly to how you can post >>>> code on forum pages? >>>> >>>> <code> >>>> blah >>>> blah >>>> blah >>>> </code> >>>> >>>> thanks! >>>> >>>> Richard >>>> >>>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <<do...@lm...> >>>> do...@lm...> wrote: >>>> >>>> >>>>> Greetings, >>>>> >>>>> Actually, gmbuilder.properties is for the TallyEngine only. When >>>>> dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to >>>>> find those 69 IDs, it is the SpeciesProfile code, and *only* the >>>>> SpeciesProfile code, that needs to be changed. >>>>> >>>>> Your take on how gmbuilder.properties is used, however, is >>>>> understandable. It makes sense to assume that the TallyEngine code *and* >>>>> the ID export code are based on the same characterization of the needed IDs. >>>>> This replication is originally a historical artifact: SpeciesProfile was >>>>> done first, and then TallyEngine was done later by another student. >>>>> >>>>> However, there are other factors beyond history that sort of >>>>> necessitate this duplication of desired IDs: (skip the two bullets below if >>>>> you'd rather cut to the chase of the work to be done, and discuss design >>>>> issues later) >>>>> >>>>> - The actual XML import code is a black box: this is the "canned" JAXB >>>>> library actually in action, and not our code at all. Plus, the XML import >>>>> code really does not filter (nor should it), since the goal of the >>>>> XML->relational database step is to fully capture the XML data in the >>>>> relational database. So, XML count is necessarily separated from XML >>>>> import. >>>>> >>>>> - The notion of a declarative mechanism for extracting IDs from the >>>>> relational database (which is what gmbuilder.properties/TallyEngine uses) is >>>>> interesting, but at the same time there is value in the arbitrary >>>>> computation that can be done with Java (case in point: export two versions >>>>> of an ID, with and without periods). This is not to say that it is >>>>> impossible to do this declaratively, but let's just say that the procedural >>>>> approach exists here and now, and a declarative approach will need more >>>>> thought. >>>>> >>>>> These, and other factors, are good thoughts to hold onto and would be >>>>> worthy of a good meeting discussion sometime, but bottom line for now: >>>>> modifying the export behavior is a matter of editing the *SpeciesProfile* >>>>> Java code, and not the gmbuilder.properties file. Turn your attention to >>>>> that code. >>>>> >>>>> Now, as to annotating your code...I'd just put in code comments :) Or >>>>> did you mean something else by tagging code in e-mail? >>>>> >>>>> John David N. Dionisio, PhD >>>>> Associate Professor, Computer Science >>>>> Loyola Marymount University >>>>> >>>>> >>>>> >>>>> >>>>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>>>> >>>>> > also, how do I tag code in email so it holds its formatting? I tried >>>>> a few suggestions I found on the web but they aren't holding formatting or >>>>> i'm just doing it wrong ;-D >>>>> > >>>>> > Richard >>>>> > >>>>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <<rbr...@gm...> >>>>> rbr...@gm...> wrote: >>>>> > OK, have some updates and some suggestions: >>>>> > >>>>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing >>>>> report. We verified that XML match does indeed find 4066 unique matches - 75 >>>>> of which are not in the gdb and need to be. >>>>> > >>>>> > Dr. Dahlquist informed me that she was the one who completed the gene >>>>> db testing report, not a previous student of BIO367 and had already verified >>>>> which genes were missing and where they were to be found. I had (mistakenly) >>>>> assumed that since a student had performed the gene database testing I had >>>>> to redo all of the verification. >>>>> > >>>>> > So that said, of the 75 genes missing - 69 need to be included and 6 >>>>> excluded. >>>>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" >>>>> suffix. They are all found in the ORF tag and need to be included in the >>>>> gdb." >>>>> > >>>>> > To solve this we need to add additional search criteria into the M. >>>>> tuberculosis section in gmbuilder.properties below: >>>>> > # Mycobacterium tuberculosis >>>>> > >>>>> > mycobacteriumtuberculosis_level_amount= >>>>> > >>>>> > 1 >>>>> > >>>>> > mycobacteriumtuberculosis_element_level0= >>>>> > >>>>> > uniprot/entry/gene/name&type&ordered locus >>>>> > >>>>> > mycobacteriumtuberculosis_query_level0= >>>>> > >>>>> > select count(*) from genenametype where type = 'ordered locus' and >>>>> value like 'Rv%'; >>>>> > >>>>> > mycobacteriumtuberculosis_table_name_level0= >>>>> > >>>>> > Ordered Locus >>>>> > SOLUTIONS: >>>>> > >>>>> > 1. So am i correct in my understanding that the second line is the >>>>> query used by TallyEngine to read the XML file? If so then this is the issue >>>>> we need to table for the moment until we get the gbd verified and >>>>> re-released. We will revisit this to discover why it is not only reporting >>>>> incorrectly but also why its added a second row of Ordered Locus on the >>>>> TallyEngine results page. >>>>> > >>>>> > 2. The third line is the SQL query used by postgres during the export >>>>> from XML to gdb. To find and get the ORF tagged genes could we not add the >>>>> following lines and change the count in the first line: >>>>> > >>>>> > >>>>> > # Mycobacterium tuberculosis >>>>> > >>>>> > mycobacteriumtuberculosis_level_amount=2 >>>>> > >>>>> > >>>>> > >>>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>>> locus >>>>> > >>>>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>>>> > >>>>> > >>>>> > mycobacteriumtuberculosis_query_level0= >>>>> > >>>>> > select count(*) from genenametype where type = 'ordered locus'; >>>>> > mycobacteriumtuberculosis_query_level1=select count(*) from >>>>> genenametype where type = 'orf'; >>>>> > >>>>> > >>>>> > mycobacteriumtuberculosis_table_name_level0= >>>>> > >>>>> > Ordered Locus >>>>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>>>> > >>>>> > >>>>> ---------------------------------------------------------------------------------------------------------------------------- >>>>> > >>>>> > Of course these queries would have be manually verified prior to >>>>> making these changes but this seems like we are moving in the right >>>>> direction. >>>>> > >>>>> > Richard >>>>> > >>>>> > >>>>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <<rbr...@gm...> >>>>> rbr...@gm...> wrote: >>>>> > Just got done reading previous email and understand the change in >>>>> priority. >>>>> > >>>>> > Will work on the missing ID's for now and shelve the the TalleyEngine >>>>> issue for the moment. >>>>> > >>>>> > Also great about a more formalized weekly meeting. I was going to >>>>> suggest it myself as it has been slow going so far as maybe i'm a bit too >>>>> independent in this independent study class =D >>>>> > >>>>> > Will dig further into the missing ID's later tonight and during day >>>>> tomorrow and report back. >>>>> > >>>>> > Richard >>>>> > >>>>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <<do...@lm...> >>>>> do...@lm...> wrote: >>>>> > Hi Rich, >>>>> > >>>>> > No problem. The pertinent line you're referring to, for XML, is >>>>> this, right above the line you copied: >>>>> > >>>>> > >>>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>>> locus >>>>> > >>>>> > The slash-separated section is the "path" of XML tags leading to the >>>>> element of interest; then, after the ampersand, is a name/value pair for the >>>>> desired attribute to count. Note that there is no hint of a *content*-based >>>>> filter (nor is there the capability for one, as far as I can tell in the >>>>> code). By "content," I mean that we can't specify filters based on what's >>>>> *between* the tags. We can only go as far as filter by attribute value, >>>>> e.g., type="ordered locus". >>>>> > >>>>> > But anyway, as mentioned in the earlier e-mail, let's have the >>>>> missing IDs in the .gdb take precedence for now. Please take a look at the >>>>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >>>>> the ID output can be customized, then let me know if you have any questions >>>>> or need to confirm anything. >>>>> > >>>>> > John David N. Dionisio, PhD >>>>> > Associate Professor, Computer Science >>>>> > Loyola Marymount University >>>>> > >>>>> > >>>>> > >>>>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>>>> > >>>>> > > Sorry been slammed with a programming assignment that kept needing >>>>> continued iteration and it has been all consuming until last night. But I >>>>> did get a chance to work with your comments and review the code again with a >>>>> different mind set. >>>>> > > >>>>> > > Yes, I examined the gmbuilder.properties file ( the query is also >>>>> in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >>>>> previous email ) but I don't think I see what you mean regarding the XML >>>>> count. >>>>> > > >>>>> > > I understood that: mycobacteriumtuberculosis_query_level0=select >>>>> count(*) from genenametype where type = 'ordered locus' and value like >>>>> 'Rv%'; was the db query but don't see which is the XML count... or do they >>>>> share the same query and you are saying that XML count doesn't recognize and >>>>> therefore cannot use the 'Rv%' parameter? >>>>> > > >>>>> > > Richard >>>>> > > >>>>> > > >>>>> > > >>>>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <<do...@lm...> >>>>> do...@lm...> wrote: >>>>> > > Hi Rich, >>>>> > > >>>>> > > Sorry for the delay. Had some distractions coming into the >>>>> weekend. >>>>> > > >>>>> > > You've looked at the code; have you looked at gmbuilder.properties? >>>>> (I may have mentioned it a few e-mails ago, just as you were starting to >>>>> dig into this) >>>>> > > >>>>> > > On the copy I have, the M. tuberculosis block looks like this >>>>> (indentation is mine to set it apart): >>>>> > > >>>>> > > # Mycobacterium tuberculosis >>>>> > > mycobacteriumtuberculosis_level_amount=1 >>>>> > > >>>>> > > >>>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>>> locus >>>>> > > >>>>> > > mycobacteriumtuberculosis_query_level0=select count(*) from >>>>> genenametype where type = 'ordered locus' and value like 'Rv%'; >>>>> > > >>>>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>>>> > > >>>>> > > There, I think, is the rub. Notice that the XML count does not >>>>> filter on RV%. The SQL query does. >>>>> > > >>>>> > > Unfortunately, I don't think the TallyEngine can include selective >>>>> filtering in the XML counts. If the need to do selective filtering on XML >>>>> is necessary, then I think we're looking at a new functionality for you to >>>>> implement (or, if this throws things off too much, this may have to be noted >>>>> somewhere, that the XML vs. database counts may be off because the database >>>>> count is doing some text-based filtering but the XML count does not). >>>>> > > >>>>> > > What does xmlpipedb-match say? That will at least tell you whether >>>>> the 'RV%' count is indeed correct. >>>>> > > >>>>> > > John David N. Dionisio, PhD >>>>> > > Associate Professor, Computer Science >>>>> > > Loyola Marymount University >>>>> > > >>>>> > > >>>>> > > >>>>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>>>> > > >>>>> > > > OK here is what I was able to put together from the past few >>>>> hours of code review: >>>>> > > > >>>>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>>> > > > -reveals that after the 2 System table modifications are made >>>>> adding species name and link, a PreparedStatement is instantiated which >>>>> builds and calls the base query. >>>>> > > > >>>>> > > > -The base query called is: ("SELECT value, type " + "FROM >>>>> genenametype INNER JOIN entrytype_genetype " + >>>>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >>>>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>>>> > > > >>>>> > > > -So its looking in 'ordered locus' table/column for any tuple >>>>> that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? >>>>> . >>>>> > > > The 'like' comparator and % usage are clear with the 'type' >>>>> entrytype_gene_hjid = ? >>>>> > > > >>>>> > > > -To me it seems the query makes sense so the problem is likely >>>>> elsewhere. >>>>> > > > >>>>> > > > GenMappBuilder.java: >>>>> > > > -I found method doTallies() at code line 895 which: >>>>> > > > Instantiates a Configuration called hibernateConfiguration and >>>>> assigns to it the current hibernate configuration >>>>> > > > Validates database settings by analyzing hibernateConfiguration >>>>> > > > Instantiates a CriterionList for uniprot and assigns to it >>>>> TallyType.UNIPROT >>>>> > > > Instantiates a CriterionList for go and assigns to it >>>>> TallyType.GO >>>>> > > > Determines if both xml files exist >>>>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on >>>>> both xml files and their respective CriterionList >>>>> > > > Results are then formatted for display in a table. >>>>> > > > >>>>> > > > -So enum TallyType which means that they are the only valid >>>>> datatypes which TallyEngine accepts... go to know ... >>>>> > > > >>>>> > > > -Based on the screen shot of Tally Engine it would seem that both >>>>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >>>>> returning. Likely due to both using an incorrect query (as we previously >>>>> supposed). But where are the queries?... the more I dig the more I think >>>>> they are in the criterial all the work is done against. >>>>> > > > >>>>> > > > continuing the review: >>>>> > > > getTallyResultsXML() calls Tally Engine instance method >>>>> getXmlFileCounts(xmlFile) >>>>> > > > getTallyResultsDatabase() calls Tally Engine instance method >>>>> getDbcounts(new QueryEngine(hibernateConfiguration) >>>>> > > > Both of these instanced methods originate from >>>>> TallyEngine.java... >>>>> > > > >>>>> > > > TallyEngine.java: >>>>> > > > >>>>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >>>>> digester then processes against criteria... but this quickly becomes >>>>> confusing and is hard to follow >>>>> > > > >>>>> > > > getDbcounts() then starts a db session and executes a query but >>>>> then I also get a bit lost with my limited db knowledge. >>>>> > > > >>>>> > > > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>>> > > > >>>>> > > > OVERALL I think I'm getting closer to the issues but I still feel >>>>> as if I'm missing some understanding to proceed further. Can you pass along >>>>> some of that Dondi insight and steer me in the right direction? =D >>>>> > > > >>>>> > > > -DB Tally - Not having taken databases yet certainly is limiting >>>>> my ability determine where the "criteria" are being set and how they are >>>>> followed during session activities. Also is the query we have been looking >>>>> for this whole time in the criteria or someplace else? >>>>> > > > >>>>> > > > -XML Tally - again is the query contained within the criteria >>>>> that digestXmlFile() uses to parse? >>>>> > > > >>>>> > > > Richard >>>>> > > > >>>>> > > > >>>>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <<do...@lm...> >>>>> do...@lm...> wrote: >>>>> > > > Right, schema issues are unlikely. Most count discrepancies like >>>>> this that I've seen have boiled down to forming the right query. Then, >>>>> knowing the right query (in both XML and SQL), it's a matter of making sure >>>>> that TallyEngine asks that same query. >>>>> > > > >>>>> > > > John David N. Dionisio, PhD >>>>> > > > Associate Professor, Computer Science >>>>> > > > Loyola Marymount University >>>>> > > > >>>>> > > > >>>>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>>>> > > > >>>>> > > > > OK, so based on your approach: >>>>> > > > > >>>>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match >>>>> and sql queries needed for the respective results as you requested. >>>>> > > > > >>>>> > > > > I was also thinking I may need to review the schema from xml >>>>> into postgres but the issue isn't likely a schema error. The error most >>>>> likely lies in how xmlpipedbutils queries the data from xml source and >>>>> writes to the tables what it returns? >>>>> > > > > >>>>> > > > > 2. I'll review the code: trace the entrance of tally engine in >>>>> the gmbuilder code then follow it through the xmlpipedbutils. >>>>> > > > > >>>>> > > > > Richard >>>>> > > > > >>>>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <<do...@lm...> >>>>> do...@lm...> wrote: >>>>> > > > > Just wanted to confirm (since I wasn't sure in the first >>>>> e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils >>>>> in SourceForge's Subversion repo. >>>>> > > > > >>>>> > > > > John David N. Dionisio, PhD >>>>> > > > > Associate Professor, Computer Science >>>>> > > > > Loyola Marymount University >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>>>> > > > > >>>>> > > > > > Hi Dondi, >>>>> > > > > > >>>>> > > > > > So I'm at the point in working with M tuberculosis that I was >>>>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>>>> > > > > > >>>>> > > > > > gmb2b60 Results >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > Now the proverbial question - What next to solve the Ordered >>>>> Locus import/count issue? >>>>> > > > > > >>>>> > > > > > ********************************************** >>>>> > > > > > Here is my thought process: >>>>> > > > > > >>>>> > > > > > Step 1: How does the import process work at the high level? >>>>> (obviously correct me if I'm wrong) >>>>> > > > > > >>>>> > > > > > I believe that basically as each XML tag is read, it is >>>>> placed in the proper Postgres table(s) based on some criteria. There is also >>>>> likely some sort of check that each individual tag is in valid XML format >>>>> unless we don't care at this stage (care at export) or maybe the parser just >>>>> skips over and goes on to the next . >>>>> > > > > > >>>>> > > > > > Step 2: What could be the problem? >>>>> > > > > > >>>>> > > > > > Either - >>>>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>>>> > > > > > b. Decision criteria of which table they should be added to? >>>>> > > > > > >>>>> > > > > > ********************************************** >>>>> > > > > > >>>>> > > > > > I read on the sourceforge wiki: >>>>> > > > > > >>>>> > > > > > XMLPipeDB has a modular architecture with three components >>>>> that may be used separately or together. XSD-to-DB reads an XSD (XML Schema >>>>> Definition) and automatically generates an SQL schema, Java classes, and >>>>> Hibernate mappings. XMLPipeDB Utilities provides functionality for >>>>> configuring the database, importing data, and performing queries. GenMAPP >>>>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >>>>> Gene Databases based on data from UniProt and Gene Ontology (GO). >>>>> > > > > > >>>>> > > > > > So I should probably start with the XMLPipeDB Utilities which >>>>> are where? I don't see any in the basic distribution or are they not >>>>> standalone and called from the command line? >>>>> > > > > > >>>>> > > > > > Thanks! >>>>> > > > > > >>>>> > > > > > Richard >>>>> > > > > >>>>> > > > > >>>>> > > > > <ATT00001..txt><ATT00002..txt> >>>>> > > > >>>>> > > > >>>>> > > > >>>>> ------------------------------------------------------------------------------ >>>>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >>>>> Studio XE: >>>>> > > > Pinpoint memory and threading errors before they happen. >>>>> > > > Find and fix more than 250 security defects in the development >>>>> cycle. >>>>> > > > Locate bottlenecks in serial and parallel code that limit >>>>> performance. >>>>> > > > <http://p.sf.net/sfu/intel-dev2devfeb> >>>>> http://p.sf.net/sfu/intel-dev2devfeb >>>>> > > > _______________________________________________ >>>>> > > > xmlpipedb-developer mailing list >>>>> > > > <xml...@li...> >>>>> xml...@li... >>>>> > > > >>>>> <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>> > > > >>>>> > > > <ATT00001..txt><ATT00002..txt> >>>>> > > >>>>> > > >>>>> > > >>>>> ------------------------------------------------------------------------------ >>>>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >>>>> Studio XE: >>>>> > > Pinpoint memory and threading errors before they happen. >>>>> > > Find and fix more than 250 security defects in the development >>>>> cycle. >>>>> > > Locate bottlenecks in serial and parallel code that limit >>>>> performance. >>>>> > > <http://p.sf.net/sfu/intel-dev2devfeb> >>>>> http://p.sf.net/sfu/intel-dev2devfeb >>>>> > > _______________________________________________ >>>>> > > xmlpipedb-developer mailing list >>>>> > > <xml...@li...> >>>>> xml...@li... >>>>> > > <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>> > > >>>>> > > <ATT00001..txt><ATT00002..txt> >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------ >>>>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>>>> XE: >>>>> > Pinpoint memory and threading errors before they happen. >>>>> > Find and fix more than 250 security defects in the development cycle. >>>>> > Locate bottlenecks in serial and parallel code that limit >>>>> performance. >>>>> > <http://p.sf.net/sfu/intel-dev2devfeb> >>>>> http://p.sf.net/sfu/intel-dev2devfeb >>>>> > _______________________________________________ >>>>> > xmlpipedb-developer mailing list >>>>> > <xml...@li...> >>>>> xml...@li... >>>>> > <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>> > >>>>> > >>>>> > >>>>> > <ATT00001..txt><ATT00002..txt> >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>>>> XE: >>>>> Pinpoint memory and threading errors before they happen. >>>>> Find and fix more than 250 security defects in the development cycle. >>>>> Locate bottlenecks in serial and parallel code that limit performance. >>>>> <http://p.sf.net/sfu/intel-dev2devfeb> >>>>> http://p.sf.net/sfu/intel-dev2devfeb >>>>> _______________________________________________ >>>>> xmlpipedb-developer mailing list >>>>> <xml...@li...> >>>>> xml...@li... >>>>> <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >>> Collect, index and harness all the fast moving IT data generated by your >>> applications, servers and devices whether physical, virtual or in the >>> cloud. >>> Deliver compliance at lower cost and gain new business insights. >>> Free Software Download: <http://p.sf.net/sfu/splunk-dev2dev> >>> http://p.sf.net/sfu/splunk-dev2dev >>> >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >>> Collect, index and harness all the fast moving IT data generated by your >>> applications, servers and devices whether physical, virtual or in the >>> cloud. >>> Deliver compliance at lower cost and gain new business insights. >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > |
From: Richard B. <rbr...@gm...> - 2011-02-22 05:41:02
|
ah yes... i see it... On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <do...@lm...>wrote: > Watch your parentheses: "and" has greater precedence than "or" :) > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > > OK, so here are my query results from raw SQL: > > 1. using: like 'Rv%' > > select count (*) from genenametype where type = 'ordered locus' and value > like 'Rv%'; > returns 3988 > > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > returns 70 > > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value like 'Rv%'; > returns 7011 > > 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > select count (*) from genenametype where type = 'ordered locus' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 3988 > > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 7011 > > select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 70 > > Conclusions: > > 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were > were looking for plus one more (maybe the count for missing genes is off > by 1?). > > 2. Combining the two types in a single query did not produce the results > that I expected (7011? - how did that happen????) so this is likely not our > solution... unless of course the query syntax isn't actually doing what I > think it is... > > 3. I would think the best course of action is to serialy run two separate > queries to capture all the required genes, then removing the one unneeded > gene if its truly not wanted. > > What do you think? > > Richard > > > On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...>wrote: > >> I don't recall the exact details of the missing 69, but if your query >> successfully returns them in raw SQL, then this is worth a try. You can >> integrate into the same query as long as the same columns are returned, >> which is the case here AFAIK, so go ahead and extend the existing query. >> >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: >> >> So here is the appropriate code snippet from >> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> >> * >> >> public >> *TableManager getSystemTableManagerCustomizations(TableManager >> tableManager, TableManager primarySystemTableManager, Date version) * >> throws* SQLException, InvalidParameterException { >> >> // Build the base query; we only use "ordered locus" and we only want >> >> // IDs that begin with "*Rv*." >> >> PreparedStatement ps = ConnectionManager.*getRelationalDBConnection* >> ().prepareStatement( >> "SELECT value, type " + >> >> "FROM genenametype INNER JOIN entrytype_genetype " + >> >> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> >> "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid >> = ?"); >> >> ResultSet result; >> >> *for* (Row row : primarySystemTableManager.getRows()) { >> >> ps.setInt(1, Integer.*parseInt*(row.getValue( >> "UID"))); >> >> result = ps.executeQuery(); >> >> // We actually want to keep the case where multiple ordered locus >> >> // names appear. >> >> *while* (result.next()) { >> >> // We want this name to appear in the OrderedLocusNames >> >> // system table. >> >> *for* (String id : result.getString("value").split("/")) { >> >> tableManager.submit( >> "OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id }, >> { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", >> GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID", >> row.getValue("UID") } }); >> >> } >> >> } >> >> } >> >> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> So now we want to build the base query which uses "ordered locus" and >> "orf" and we only want IDs that begin with "Rv". >> >> I know there are more comprehensive ways to search for gene ID's >> by matching gene ID prefix but "like Rv%" seemed to work thus far, we just >> need to tell it to search for XML tag type orf in addition to ordered locus. >> >> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and >> entrytype_gene_hjid = ? " >> >> Here is a stab at it.... This part of our class was right as the server >> went down and my submission for week 6 assignment I can't seem to find. >> >> Is it possible to have two different types in the same query or should we >> rewrite a separate query for the orf tag? >> >> Richard >> >> >> >> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous < <rbr...@gm...> >> rbr...@gm...> wrote: >> >> >>> thanks and will do as directed. >>> >>> My previous, last paragraph comment - A way for programming code in email >>> holding its format in a mail message similarly to how you can post code on >>> forum pages? >>> >>> <code> >>> blah >>> blah >>> blah >>> </code> >>> >>> thanks! >>> >>> Richard >>> >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <<do...@lm...> >>> do...@lm...> wrote: >>> >>> >>>> Greetings, >>>> >>>> Actually, gmbuilder.properties is for the TallyEngine only. When >>>> dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to >>>> find those 69 IDs, it is the SpeciesProfile code, and *only* the >>>> SpeciesProfile code, that needs to be changed. >>>> >>>> Your take on how gmbuilder.properties is used, however, is >>>> understandable. It makes sense to assume that the TallyEngine code *and* >>>> the ID export code are based on the same characterization of the needed IDs. >>>> This replication is originally a historical artifact: SpeciesProfile was >>>> done first, and then TallyEngine was done later by another student. >>>> >>>> However, there are other factors beyond history that sort of necessitate >>>> this duplication of desired IDs: (skip the two bullets below if you'd rather >>>> cut to the chase of the work to be done, and discuss design issues later) >>>> >>>> - The actual XML import code is a black box: this is the "canned" JAXB >>>> library actually in action, and not our code at all. Plus, the XML import >>>> code really does not filter (nor should it), since the goal of the >>>> XML->relational database step is to fully capture the XML data in the >>>> relational database. So, XML count is necessarily separated from XML >>>> import. >>>> >>>> - The notion of a declarative mechanism for extracting IDs from the >>>> relational database (which is what gmbuilder.properties/TallyEngine uses) is >>>> interesting, but at the same time there is value in the arbitrary >>>> computation that can be done with Java (case in point: export two versions >>>> of an ID, with and without periods). This is not to say that it is >>>> impossible to do this declaratively, but let's just say that the procedural >>>> approach exists here and now, and a declarative approach will need more >>>> thought. >>>> >>>> These, and other factors, are good thoughts to hold onto and would be >>>> worthy of a good meeting discussion sometime, but bottom line for now: >>>> modifying the export behavior is a matter of editing the *SpeciesProfile* >>>> Java code, and not the gmbuilder.properties file. Turn your attention to >>>> that code. >>>> >>>> Now, as to annotating your code...I'd just put in code comments :) Or >>>> did you mean something else by tagging code in e-mail? >>>> >>>> John David N. Dionisio, PhD >>>> Associate Professor, Computer Science >>>> Loyola Marymount University >>>> >>>> >>>> >>>> >>>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>>> >>>> > also, how do I tag code in email so it holds its formatting? I tried a >>>> few suggestions I found on the web but they aren't holding formatting or i'm >>>> just doing it wrong ;-D >>>> > >>>> > Richard >>>> > >>>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous < <rbr...@gm...> >>>> rbr...@gm...> wrote: >>>> > OK, have some updates and some suggestions: >>>> > >>>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing >>>> report. We verified that XML match does indeed find 4066 unique matches - 75 >>>> of which are not in the gdb and need to be. >>>> > >>>> > Dr. Dahlquist informed me that she was the one who completed the gene >>>> db testing report, not a previous student of BIO367 and had already verified >>>> which genes were missing and where they were to be found. I had (mistakenly) >>>> assumed that since a student had performed the gene database testing I had >>>> to redo all of the verification. >>>> > >>>> > So that said, of the 75 genes missing - 69 need to be included and 6 >>>> excluded. >>>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" >>>> suffix. They are all found in the ORF tag and need to be included in the >>>> gdb." >>>> > >>>> > To solve this we need to add additional search criteria into the M. >>>> tuberculosis section in gmbuilder.properties below: >>>> > # Mycobacterium tuberculosis >>>> > >>>> > mycobacteriumtuberculosis_level_amount= >>>> > >>>> > 1 >>>> > >>>> > mycobacteriumtuberculosis_element_level0= >>>> > >>>> > uniprot/entry/gene/name&type&ordered locus >>>> > >>>> > mycobacteriumtuberculosis_query_level0= >>>> > >>>> > select count(*) from genenametype where type = 'ordered locus' and >>>> value like 'Rv%'; >>>> > >>>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>>> > Ordered Locus >>>> > SOLUTIONS: >>>> > >>>> > 1. So am i correct in my understanding that the second line is the >>>> query used by TallyEngine to read the XML file? If so then this is the issue >>>> we need to table for the moment until we get the gbd verified and >>>> re-released. We will revisit this to discover why it is not only reporting >>>> incorrectly but also why its added a second row of Ordered Locus on the >>>> TallyEngine results page. >>>> > >>>> > 2. The third line is the SQL query used by postgres during the export >>>> from XML to gdb. To find and get the ORF tagged genes could we not add the >>>> following lines and change the count in the first line: >>>> > >>>> > >>>> > # Mycobacterium tuberculosis >>>> > >>>> > mycobacteriumtuberculosis_level_amount=2 >>>> > >>>> > >>>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>>> > >>>> > >>>> > mycobacteriumtuberculosis_query_level0= >>>> > >>>> > select count(*) from genenametype where type = 'ordered locus'; >>>> > mycobacteriumtuberculosis_query_level1=select count(*) from >>>> genenametype where type = 'orf'; >>>> > >>>> > >>>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>>> > Ordered Locus >>>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>>> > >>>> > >>>> ---------------------------------------------------------------------------------------------------------------------------- >>>> > >>>> > Of course these queries would have be manually verified prior to >>>> making these changes but this seems like we are moving in the right >>>> direction. >>>> > >>>> > Richard >>>> > >>>> > >>>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous < <rbr...@gm...> >>>> rbr...@gm...> wrote: >>>> > Just got done reading previous email and understand the change in >>>> priority. >>>> > >>>> > Will work on the missing ID's for now and shelve the the TalleyEngine >>>> issue for the moment. >>>> > >>>> > Also great about a more formalized weekly meeting. I was going to >>>> suggest it myself as it has been slow going so far as maybe i'm a bit too >>>> independent in this independent study class =D >>>> > >>>> > Will dig further into the missing ID's later tonight and during day >>>> tomorrow and report back. >>>> > >>>> > Richard >>>> > >>>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <<do...@lm...> >>>> do...@lm...> wrote: >>>> > Hi Rich, >>>> > >>>> > No problem. The pertinent line you're referring to, for XML, is this, >>>> right above the line you copied: >>>> > >>>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>>> > The slash-separated section is the "path" of XML tags leading to the >>>> element of interest; then, after the ampersand, is a name/value pair for the >>>> desired attribute to count. Note that there is no hint of a *content*-based >>>> filter (nor is there the capability for one, as far as I can tell in the >>>> code). By "content," I mean that we can't specify filters based on what's >>>> *between* the tags. We can only go as far as filter by attribute value, >>>> e.g., type="ordered locus". >>>> > >>>> > But anyway, as mentioned in the earlier e-mail, let's have the missing >>>> IDs in the .gdb take precedence for now. Please take a look at the >>>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >>>> the ID output can be customized, then let me know if you have any questions >>>> or need to confirm anything. >>>> > >>>> > John David N. Dionisio, PhD >>>> > Associate Professor, Computer Science >>>> > Loyola Marymount University >>>> > >>>> > >>>> > >>>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>>> > >>>> > > Sorry been slammed with a programming assignment that kept needing >>>> continued iteration and it has been all consuming until last night. But I >>>> did get a chance to work with your comments and review the code again with a >>>> different mind set. >>>> > > >>>> > > Yes, I examined the gmbuilder.properties file ( the query is also in >>>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >>>> previous email ) but I don't think I see what you mean regarding the XML >>>> count. >>>> > > >>>> > > I understood that: mycobacteriumtuberculosis_query_level0=select >>>> count(*) from genenametype where type = 'ordered locus' and value like >>>> 'Rv%'; was the db query but don't see which is the XML count... or do they >>>> share the same query and you are saying that XML count doesn't recognize and >>>> therefore cannot use the 'Rv%' parameter? >>>> > > >>>> > > Richard >>>> > > >>>> > > >>>> > > >>>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <<do...@lm...> >>>> do...@lm...> wrote: >>>> > > Hi Rich, >>>> > > >>>> > > Sorry for the delay. Had some distractions coming into the weekend. >>>> > > >>>> > > You've looked at the code; have you looked at gmbuilder.properties? >>>> (I may have mentioned it a few e-mails ago, just as you were starting to >>>> dig into this) >>>> > > >>>> > > On the copy I have, the M. tuberculosis block looks like this >>>> (indentation is mine to set it apart): >>>> > > >>>> > > # Mycobacterium tuberculosis >>>> > > mycobacteriumtuberculosis_level_amount=1 >>>> > > >>>> > > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > > >>>> > > mycobacteriumtuberculosis_query_level0=select count(*) from >>>> genenametype where type = 'ordered locus' and value like 'Rv%'; >>>> > > >>>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>>> > > >>>> > > There, I think, is the rub. Notice that the XML count does not >>>> filter on RV%. The SQL query does. >>>> > > >>>> > > Unfortunately, I don't think the TallyEngine can include selective >>>> filtering in the XML counts. If the need to do selective filtering on XML >>>> is necessary, then I think we're looking at a new functionality for you to >>>> implement (or, if this throws things off too much, this may have to be noted >>>> somewhere, that the XML vs. database counts may be off because the database >>>> count is doing some text-based filtering but the XML count does not). >>>> > > >>>> > > What does xmlpipedb-match say? That will at least tell you whether >>>> the 'RV%' count is indeed correct. >>>> > > >>>> > > John David N. Dionisio, PhD >>>> > > Associate Professor, Computer Science >>>> > > Loyola Marymount University >>>> > > >>>> > > >>>> > > >>>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>>> > > >>>> > > > OK here is what I was able to put together from the past few hours >>>> of code review: >>>> > > > >>>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>> > > > -reveals that after the 2 System table modifications are made >>>> adding species name and link, a PreparedStatement is instantiated which >>>> builds and calls the base query. >>>> > > > >>>> > > > -The base query called is: ("SELECT value, type " + "FROM >>>> genenametype INNER JOIN entrytype_genetype " + >>>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >>>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>>> > > > >>>> > > > -So its looking in 'ordered locus' table/column for any tuple that >>>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>>> > > > The 'like' comparator and % usage are clear with the 'type' >>>> entrytype_gene_hjid = ? >>>> > > > >>>> > > > -To me it seems the query makes sense so the problem is likely >>>> elsewhere. >>>> > > > >>>> > > > GenMappBuilder.java: >>>> > > > -I found method doTallies() at code line 895 which: >>>> > > > Instantiates a Configuration called hibernateConfiguration and >>>> assigns to it the current hibernate configuration >>>> > > > Validates database settings by analyzing hibernateConfiguration >>>> > > > Instantiates a CriterionList for uniprot and assigns to it >>>> TallyType.UNIPROT >>>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>>> > > > Determines if both xml files exist >>>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on >>>> both xml files and their respective CriterionList >>>> > > > Results are then formatted for display in a table. >>>> > > > >>>> > > > -So enum TallyType which means that they are the only valid >>>> datatypes which TallyEngine accepts... go to know ... >>>> > > > >>>> > > > -Based on the screen shot of Tally Engine it would seem that both >>>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >>>> returning. Likely due to both using an incorrect query (as we previously >>>> supposed). But where are the queries?... the more I dig the more I think >>>> they are in the criterial all the work is done against. >>>> > > > >>>> > > > continuing the review: >>>> > > > getTallyResultsXML() calls Tally Engine instance method >>>> getXmlFileCounts(xmlFile) >>>> > > > getTallyResultsDatabase() calls Tally Engine instance method >>>> getDbcounts(new QueryEngine(hibernateConfiguration) >>>> > > > Both of these instanced methods originate from TallyEngine.java... >>>> > > > >>>> > > > TallyEngine.java: >>>> > > > >>>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >>>> digester then processes against criteria... but this quickly becomes >>>> confusing and is hard to follow >>>> > > > >>>> > > > getDbcounts() then starts a db session and executes a query but >>>> then I also get a bit lost with my limited db knowledge. >>>> > > > >>>> > > > >>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>> > > > >>>> > > > OVERALL I think I'm getting closer to the issues but I still feel >>>> as if I'm missing some understanding to proceed further. Can you pass along >>>> some of that Dondi insight and steer me in the right direction? =D >>>> > > > >>>> > > > -DB Tally - Not having taken databases yet certainly is limiting >>>> my ability determine where the "criteria" are being set and how they are >>>> followed during session activities. Also is the query we have been looking >>>> for this whole time in the criteria or someplace else? >>>> > > > >>>> > > > -XML Tally - again is the query contained within the criteria that >>>> digestXmlFile() uses to parse? >>>> > > > >>>> > > > Richard >>>> > > > >>>> > > > >>>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <<do...@lm...> >>>> do...@lm...> wrote: >>>> > > > Right, schema issues are unlikely. Most count discrepancies like >>>> this that I've seen have boiled down to forming the right query. Then, >>>> knowing the right query (in both XML and SQL), it's a matter of making sure >>>> that TallyEngine asks that same query. >>>> > > > >>>> > > > John David N. Dionisio, PhD >>>> > > > Associate Professor, Computer Science >>>> > > > Loyola Marymount University >>>> > > > >>>> > > > >>>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>>> > > > >>>> > > > > OK, so based on your approach: >>>> > > > > >>>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >>>> sql queries needed for the respective results as you requested. >>>> > > > > >>>> > > > > I was also thinking I may need to review the schema from xml >>>> into postgres but the issue isn't likely a schema error. The error most >>>> likely lies in how xmlpipedbutils queries the data from xml source and >>>> writes to the tables what it returns? >>>> > > > > >>>> > > > > 2. I'll review the code: trace the entrance of tally engine in >>>> the gmbuilder code then follow it through the xmlpipedbutils. >>>> > > > > >>>> > > > > Richard >>>> > > > > >>>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <<do...@lm...> >>>> do...@lm...> wrote: >>>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >>>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >>>> SourceForge's Subversion repo. >>>> > > > > >>>> > > > > John David N. Dionisio, PhD >>>> > > > > Associate Professor, Computer Science >>>> > > > > Loyola Marymount University >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>>> > > > > >>>> > > > > > Hi Dondi, >>>> > > > > > >>>> > > > > > So I'm at the point in working with M tuberculosis that I was >>>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>>> > > > > > >>>> > > > > > gmb2b60 Results >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > Now the proverbial question - What next to solve the Ordered >>>> Locus import/count issue? >>>> > > > > > >>>> > > > > > ********************************************** >>>> > > > > > Here is my thought process: >>>> > > > > > >>>> > > > > > Step 1: How does the import process work at the high level? >>>> (obviously correct me if I'm wrong) >>>> > > > > > >>>> > > > > > I believe that basically as each XML tag is read, it is placed >>>> in the proper Postgres table(s) based on some criteria. There is also likely >>>> some sort of check that each individual tag is in valid XML format unless we >>>> don't care at this stage (care at export) or maybe the parser just skips >>>> over and goes on to the next . >>>> > > > > > >>>> > > > > > Step 2: What could be the problem? >>>> > > > > > >>>> > > > > > Either - >>>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>>> > > > > > b. Decision criteria of which table they should be added to? >>>> > > > > > >>>> > > > > > ********************************************** >>>> > > > > > >>>> > > > > > I read on the sourceforge wiki: >>>> > > > > > >>>> > > > > > XMLPipeDB has a modular architecture with three components >>>> that may be used separately or together. XSD-to-DB reads an XSD (XML Schema >>>> Definition) and automatically generates an SQL schema, Java classes, and >>>> Hibernate mappings. XMLPipeDB Utilities provides functionality for >>>> configuring the database, importing data, and performing queries. GenMAPP >>>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >>>> Gene Databases based on data from UniProt and Gene Ontology (GO). >>>> > > > > > >>>> > > > > > So I should probably start with the XMLPipeDB Utilities which >>>> are where? I don't see any in the basic distribution or are they not >>>> standalone and called from the command line? >>>> > > > > > >>>> > > > > > Thanks! >>>> > > > > > >>>> > > > > > Richard >>>> > > > > >>>> > > > > >>>> > > > > <ATT00001..txt><ATT00002..txt> >>>> > > > >>>> > > > >>>> > > > >>>> ------------------------------------------------------------------------------ >>>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >>>> Studio XE: >>>> > > > Pinpoint memory and threading errors before they happen. >>>> > > > Find and fix more than 250 security defects in the development >>>> cycle. >>>> > > > Locate bottlenecks in serial and parallel code that limit >>>> performance. >>>> > > > <http://p.sf.net/sfu/intel-dev2devfeb> >>>> http://p.sf.net/sfu/intel-dev2devfeb >>>> > > > _______________________________________________ >>>> > > > xmlpipedb-developer mailing list >>>> > > > <xml...@li...> >>>> xml...@li... >>>> > > > >>>> <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> > > > >>>> > > > <ATT00001..txt><ATT00002..txt> >>>> > > >>>> > > >>>> > > >>>> ------------------------------------------------------------------------------ >>>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >>>> Studio XE: >>>> > > Pinpoint memory and threading errors before they happen. >>>> > > Find and fix more than 250 security defects in the development >>>> cycle. >>>> > > Locate bottlenecks in serial and parallel code that limit >>>> performance. >>>> > > <http://p.sf.net/sfu/intel-dev2devfeb> >>>> http://p.sf.net/sfu/intel-dev2devfeb >>>> > > _______________________________________________ >>>> > > xmlpipedb-developer mailing list >>>> > > <xml...@li...> >>>> xml...@li... >>>> > > <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> > > >>>> > > <ATT00001..txt><ATT00002..txt> >>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>>> XE: >>>> > Pinpoint memory and threading errors before they happen. >>>> > Find and fix more than 250 security defects in the development cycle. >>>> > Locate bottlenecks in serial and parallel code that limit performance. >>>> > <http://p.sf.net/sfu/intel-dev2devfeb> >>>> http://p.sf.net/sfu/intel-dev2devfeb >>>> > _______________________________________________ >>>> > xmlpipedb-developer mailing list >>>> > <xml...@li...> >>>> xml...@li... >>>> > <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> > >>>> > >>>> > >>>> > <ATT00001..txt><ATT00002..txt> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>>> XE: >>>> Pinpoint memory and threading errors before they happen. >>>> Find and fix more than 250 security defects in the development cycle. >>>> Locate bottlenecks in serial and parallel code that limit performance. >>>> <http://p.sf.net/sfu/intel-dev2devfeb> >>>> http://p.sf.net/sfu/intel-dev2devfeb >>>> _______________________________________________ >>>> xmlpipedb-developer mailing list >>>> <xml...@li...> >>>> xml...@li... >>>> <https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer> >>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> >>>> >>> >>> >> >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: <http://p.sf.net/sfu/splunk-dev2dev> >> http://p.sf.net/sfu/splunk-dev2dev >> >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: John D. N. D. <do...@lm...> - 2011-02-22 05:25:20
|
Watch your parentheses: "and" has greater precedence than "or" :) John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> wrote: > OK, so here are my query results from raw SQL: > > 1. using: like 'Rv%' > > select count (*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > returns 3988 > > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > returns 70 > > select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value like 'Rv%'; > returns 7011 > > 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > select count (*) from genenametype where type = 'ordered locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 3988 > > select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 7011 > > select count (*) from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 70 > > Conclusions: > > 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were were looking for plus one more (maybe the count for missing genes is off by 1?). > > 2. Combining the two types in a single query did not produce the results that I expected (7011? - how did that happen????) so this is likely not our solution... unless of course the query syntax isn't actually doing what I think it is... > > 3. I would think the best course of action is to serialy run two separate queries to capture all the required genes, then removing the one unneeded gene if its truly not wanted. > > What do you think? > > Richard > > > On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > I don't recall the exact details of the missing 69, but if your query successfully returns them in raw SQL, then this is worth a try. You can integrate into the same query as long as the same columns are returned, which is the case here AFAIK, so go ahead and extend the existing query. > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> public >> >> TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { >> // Build the base query; we only use "ordered locus" and we only want >> // IDs that begin with "Rv." >> PreparedStatement ps = ConnectionManager.getRelationalDBConnection().prepareStatement( >> >> "SELECT value, type " + >> "FROM genenametype INNER JOIN entrytype_genetype " + >> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); >> ResultSet result; >> >> for (Row row : primarySystemTableManager.getRows()) { >> ps.setInt(1, Integer.parseInt(row.getValue( >> >> "UID"))); >> result = ps.executeQuery(); >> >> // We actually want to keep the case where multiple ordered locus >> // names appear. >> while (result.next()) { >> // We want this name to appear in the OrderedLocusNames >> // system table. >> for (String id : result.getString("value").split("/")) { >> tableManager.submit( >> >> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", row.getValue("UID") } }); >> } >> >> } >> >> } >> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". >> >> I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. >> >> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " >> >> Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. >> >> Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? >> >> Richard >> >> >> >> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: >> >> thanks and will do as directed. >> >> My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? >> >> <code> >> blah >> blah >> blah >> </code> >> >> thanks! >> >> Richard >> >> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...> wrote: >> >> Greetings, >> >> Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. >> >> Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. >> >> However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) >> >> - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. >> >> - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. >> >> These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. >> >> Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> >> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> >> > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D >> > >> > Richard >> > >> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: >> > OK, have some updates and some suggestions: >> > >> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. >> > >> > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. >> > >> > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. >> > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." >> > >> > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount= >> > >> > 1 >> > >> > mycobacteriumtuberculosis_element_level0= >> > >> > uniprot/entry/gene/name&type&ordered locus >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > SOLUTIONS: >> > >> > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. >> > >> > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: >> > >> > >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount=2 >> > >> > >> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >> > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus'; >> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; >> > >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >> > ---------------------------------------------------------------------------------------------------------------------------- >> > >> > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. >> > >> > Richard >> > >> > >> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: >> > Just got done reading previous email and understand the change in priority. >> > >> > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. >> > >> > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D >> > >> > Will dig further into the missing ID's later tonight and during day tomorrow and report back. >> > >> > Richard >> > >> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: >> > Hi Rich, >> > >> > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: >> > >> > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >> > >> > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". >> > >> > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >> > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. >> > > >> > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. >> > > >> > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? >> > > >> > > Richard >> > > >> > > >> > > >> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: >> > > Hi Rich, >> > > >> > > Sorry for the delay. Had some distractions coming into the weekend. >> > > >> > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) >> > > >> > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): >> > > >> > > # Mycobacterium tuberculosis >> > > mycobacteriumtuberculosis_level_amount=1 >> > > >> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus >> > > >> > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >> > > >> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > > >> > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. >> > > >> > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). >> > > >> > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > >> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > > >> > > > OK here is what I was able to put together from the past few hours of code review: >> > > > >> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. >> > > > >> > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >> > > > >> > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >> > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? >> > > > >> > > > -To me it seems the query makes sense so the problem is likely elsewhere. >> > > > >> > > > GenMappBuilder.java: >> > > > -I found method doTallies() at code line 895 which: >> > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration >> > > > Validates database settings by analyzing hibernateConfiguration >> > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT >> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >> > > > Determines if both xml files exist >> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList >> > > > Results are then formatted for display in a table. >> > > > >> > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... >> > > > >> > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. >> > > > >> > > > continuing the review: >> > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) >> > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) >> > > > Both of these instanced methods originate from TallyEngine.java... >> > > > >> > > > TallyEngine.java: >> > > > >> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow >> > > > >> > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. >> > > > >> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > > > >> > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D >> > > > >> > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? >> > > > >> > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? >> > > > >> > > > Richard >> > > > >> > > > >> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: >> > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. >> > > > >> > > > John David N. Dionisio, PhD >> > > > Associate Professor, Computer Science >> > > > Loyola Marymount University >> > > > >> > > > >> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > > > >> > > > > OK, so based on your approach: >> > > > > >> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. >> > > > > >> > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? >> > > > > >> > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >> > > > > >> > > > > Richard >> > > > > >> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: >> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. >> > > > > >> > > > > John David N. Dionisio, PhD >> > > > > Associate Professor, Computer Science >> > > > > Loyola Marymount University >> > > > > >> > > > > >> > > > > >> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > > > > >> > > > > > Hi Dondi, >> > > > > > >> > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >> > > > > > >> > > > > > gmb2b60 Results >> > > > > > >> > > > > > >> > > > > > >> > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? >> > > > > > >> > > > > > ********************************************** >> > > > > > Here is my thought process: >> > > > > > >> > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) >> > > > > > >> > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . >> > > > > > >> > > > > > Step 2: What could be the problem? >> > > > > > >> > > > > > Either - >> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >> > > > > > b. Decision criteria of which table they should be added to? >> > > > > > >> > > > > > ********************************************** >> > > > > > >> > > > > > I read on the sourceforge wiki: >> > > > > > >> > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). >> > > > > > >> > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > Richard >> > > > > >> > > > > >> > > > > <ATT00001..txt><ATT00002..txt> >> > > > >> > > > >> > > > ------------------------------------------------------------------------------ >> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> > > > Pinpoint memory and threading errors before they happen. >> > > > Find and fix more than 250 security defects in the development cycle. >> > > > Locate bottlenecks in serial and parallel code that limit performance. >> > > > http://p.sf.net/sfu/intel-dev2devfeb >> > > > _______________________________________________ >> > > > xmlpipedb-developer mailing list >> > > > xml...@li... >> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > > >> > > > <ATT00001..txt><ATT00002..txt> >> > > >> > > >> > > ------------------------------------------------------------------------------ >> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> > > Pinpoint memory and threading errors before they happen. >> > > Find and fix more than 250 security defects in the development cycle. >> > > Locate bottlenecks in serial and parallel code that limit performance. >> > > http://p.sf.net/sfu/intel-dev2devfeb >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > ------------------------------------------------------------------------------ >> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> > Pinpoint memory and threading errors before they happen. >> > Find and fix more than 250 security defects in the development cycle. >> > Locate bottlenecks in serial and parallel code that limit performance. >> > http://p.sf.net/sfu/intel-dev2devfeb >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> ------------------------------------------------------------------------------ >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> > >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer |
From: Richard B. <rbr...@gm...> - 2011-02-22 04:00:07
|
OK, so here are my query results from raw SQL: 1. using: like 'Rv%' select count (*) from genenametype where type = 'ordered locus' and value like 'Rv%'; returns 3988 select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; returns 70 select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value like 'Rv%'; returns 7011 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' select count (*) from genenametype where type = 'ordered locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; returns 3988 select count (*) from genenametype where type = 'ordered locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; returns 7011 select count (*) from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; returns 70 Conclusions: 1. It seems that querying for type = 'ORF' alone surfaces the 69 genes were were looking for plus one more (maybe the count for missing genes is off by 1?). 2. Combining the two types in a single query did not produce the results that I expected (7011? - how did that happen????) so this is likely not our solution... unless of course the query syntax isn't actually doing what I think it is... 3. I would think the best course of action is to serialy run two separate queries to capture all the required genes, then removing the one unneeded gene if its truly not wanted. What do you think? Richard On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...>wrote: > I don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > > So here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > * > > public > *TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) * > throws* SQLException, InvalidParameterException { > > // Build the base query; we only use "ordered locus" and we only want > > // IDs that begin with "*Rv*." > > PreparedStatement ps = ConnectionManager.*getRelationalDBConnection* > ().prepareStatement( > "SELECT value, type " + > > "FROM genenametype INNER JOIN entrytype_genetype " + > > "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > > "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid > = ?"); > > ResultSet result; > > *for* (Row row : primarySystemTableManager.getRows()) { > > ps.setInt(1, Integer.*parseInt*(row.getValue( > "UID"))); > > result = ps.executeQuery(); > > // We actually want to keep the case where multiple ordered locus > > // names appear. > > *while* (result.next()) { > > // We want this name to appear in the OrderedLocusNames > > // system table. > > *for* (String id : result.getString("value").split("/")) { > > tableManager.submit( > "OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id }, > { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", > GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID", > row.getValue("UID") } }); > > } > > } > > } > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > So now we want to build the base query which uses "ordered locus" and "orf" > and we only want IDs that begin with "Rv". > > I know there are more comprehensive ways to search for gene ID's > by matching gene ID prefix but "like Rv%" seemed to work thus far, we just > need to tell it to search for XML tag type orf in addition to ordered locus. > > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > > Here is a stab at it.... This part of our class was right as the server > went down and my submission for week 6 assignment I can't seem to find. > > Is it possible to have two different types in the same query or should we > rewrite a separate query for the orf tag? > > Richard > > > > On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...>wrote: > > >> thanks and will do as directed. >> >> My previous, last paragraph comment - A way for programming code in email >> holding its format in a mail message similarly to how you can post code on >> forum pages? >> >> <code> >> blah >> blah >> blah >> </code> >> >> thanks! >> >> Richard >> >> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm... >> > wrote: >> >> >>> Greetings, >>> >>> Actually, gmbuilder.properties is for the TallyEngine only. When dealing >>> with .gdb exports, look *only* at the SpeciesProfile class. So, to find >>> those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile >>> code, that needs to be changed. >>> >>> Your take on how gmbuilder.properties is used, however, is >>> understandable. It makes sense to assume that the TallyEngine code *and* >>> the ID export code are based on the same characterization of the needed IDs. >>> This replication is originally a historical artifact: SpeciesProfile was >>> done first, and then TallyEngine was done later by another student. >>> >>> However, there are other factors beyond history that sort of necessitate >>> this duplication of desired IDs: (skip the two bullets below if you'd rather >>> cut to the chase of the work to be done, and discuss design issues later) >>> >>> - The actual XML import code is a black box: this is the "canned" JAXB >>> library actually in action, and not our code at all. Plus, the XML import >>> code really does not filter (nor should it), since the goal of the >>> XML->relational database step is to fully capture the XML data in the >>> relational database. So, XML count is necessarily separated from XML >>> import. >>> >>> - The notion of a declarative mechanism for extracting IDs from the >>> relational database (which is what gmbuilder.properties/TallyEngine uses) is >>> interesting, but at the same time there is value in the arbitrary >>> computation that can be done with Java (case in point: export two versions >>> of an ID, with and without periods). This is not to say that it is >>> impossible to do this declaratively, but let's just say that the procedural >>> approach exists here and now, and a declarative approach will need more >>> thought. >>> >>> These, and other factors, are good thoughts to hold onto and would be >>> worthy of a good meeting discussion sometime, but bottom line for now: >>> modifying the export behavior is a matter of editing the *SpeciesProfile* >>> Java code, and not the gmbuilder.properties file. Turn your attention to >>> that code. >>> >>> Now, as to annotating your code...I'd just put in code comments :) Or >>> did you mean something else by tagging code in e-mail? >>> >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> >>> >>> >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>> >>> > also, how do I tag code in email so it holds its formatting? I tried a >>> few suggestions I found on the web but they aren't holding formatting or i'm >>> just doing it wrong ;-D >>> > >>> > Richard >>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> >>> wrote: >>> > OK, have some updates and some suggestions: >>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing >>> report. We verified that XML match does indeed find 4066 unique matches - 75 >>> of which are not in the gdb and need to be. >>> > >>> > Dr. Dahlquist informed me that she was the one who completed the gene >>> db testing report, not a previous student of BIO367 and had already verified >>> which genes were missing and where they were to be found. I had (mistakenly) >>> assumed that since a student had performed the gene database testing I had >>> to redo all of the verification. >>> > >>> > So that said, of the 75 genes missing - 69 need to be included and 6 >>> excluded. >>> > Per the gene db testing report: "69 of them have an "a", "b", or "d" >>> suffix. They are all found in the ORF tag and need to be included in the >>> gdb." >>> > >>> > To solve this we need to add additional search criteria into the M. >>> tuberculosis section in gmbuilder.properties below: >>> > # Mycobacterium tuberculosis >>> > >>> > mycobacteriumtuberculosis_level_amount= >>> > >>> > 1 >>> > >>> > mycobacteriumtuberculosis_element_level0= >>> > >>> > uniprot/entry/gene/name&type&ordered locus >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > select count(*) from genenametype where type = 'ordered locus' and >>> value like 'Rv%'; >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > Ordered Locus >>> > SOLUTIONS: >>> > >>> > 1. So am i correct in my understanding that the second line is the >>> query used by TallyEngine to read the XML file? If so then this is the issue >>> we need to table for the moment until we get the gbd verified and >>> re-released. We will revisit this to discover why it is not only reporting >>> incorrectly but also why its added a second row of Ordered Locus on the >>> TallyEngine results page. >>> > >>> > 2. The third line is the SQL query used by postgres during the export >>> from XML to gdb. To find and get the ORF tagged genes could we not add the >>> following lines and change the count in the first line: >>> > >>> > >>> > # Mycobacterium tuberculosis >>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>> > >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>> > >>> > >>> > mycobacteriumtuberculosis_query_level0= >>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>> > mycobacteriumtuberculosis_query_level1=select count(*) from >>> genenametype where type = 'orf'; >>> > >>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>> > >>> > Ordered Locus >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>> > >>> > >>> ---------------------------------------------------------------------------------------------------------------------------- >>> > >>> > Of course these queries would have be manually verified prior to making >>> these changes but this seems like we are moving in the right direction. >>> > >>> > Richard >>> > >>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> >>> wrote: >>> > Just got done reading previous email and understand the change in >>> priority. >>> > >>> > Will work on the missing ID's for now and shelve the the TalleyEngine >>> issue for the moment. >>> > >>> > Also great about a more formalized weekly meeting. I was going to >>> suggest it myself as it has been slow going so far as maybe i'm a bit too >>> independent in this independent study class =D >>> > >>> > Will dig further into the missing ID's later tonight and during day >>> tomorrow and report back. >>> > >>> > Richard >>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> >>> wrote: >>> > Hi Rich, >>> > >>> > No problem. The pertinent line you're referring to, for XML, is this, >>> right above the line you copied: >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > The slash-separated section is the "path" of XML tags leading to the >>> element of interest; then, after the ampersand, is a name/value pair for the >>> desired attribute to count. Note that there is no hint of a *content*-based >>> filter (nor is there the capability for one, as far as I can tell in the >>> code). By "content," I mean that we can't specify filters based on what's >>> *between* the tags. We can only go as far as filter by attribute value, >>> e.g., type="ordered locus". >>> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the missing >>> IDs in the .gdb take precedence for now. Please take a look at the >>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >>> the ID output can be customized, then let me know if you have any questions >>> or need to confirm anything. >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> > >>> > > Sorry been slammed with a programming assignment that kept needing >>> continued iteration and it has been all consuming until last night. But I >>> did get a chance to work with your comments and review the code again with a >>> different mind set. >>> > > >>> > > Yes, I examined the gmbuilder.properties file ( the query is also in >>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >>> previous email ) but I don't think I see what you mean regarding the XML >>> count. >>> > > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select >>> count(*) from genenametype where type = 'ordered locus' and value like >>> 'Rv%'; was the db query but don't see which is the XML count... or do they >>> share the same query and you are saying that XML count doesn't recognize and >>> therefore cannot use the 'Rv%' parameter? >>> > > >>> > > Richard >>> > > >>> > > >>> > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > Hi Rich, >>> > > >>> > > Sorry for the delay. Had some distractions coming into the weekend. >>> > > >>> > > You've looked at the code; have you looked at gmbuilder.properties? >>> (I may have mentioned it a few e-mails ago, just as you were starting to >>> dig into this) >>> > > >>> > > On the copy I have, the M. tuberculosis block looks like this >>> (indentation is mine to set it apart): >>> > > >>> > > # Mycobacterium tuberculosis >>> > > mycobacteriumtuberculosis_level_amount=1 >>> > > >>> > > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) from >>> genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > > >>> > > There, I think, is the rub. Notice that the XML count does not >>> filter on RV%. The SQL query does. >>> > > >>> > > Unfortunately, I don't think the TallyEngine can include selective >>> filtering in the XML counts. If the need to do selective filtering on XML >>> is necessary, then I think we're looking at a new functionality for you to >>> implement (or, if this throws things off too much, this may have to be noted >>> somewhere, that the XML vs. database counts may be off because the database >>> count is doing some text-based filtering but the XML count does not). >>> > > >>> > > What does xmlpipedb-match say? That will at least tell you whether >>> the 'RV%' count is indeed correct. >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > > >>> > > > OK here is what I was able to put together from the past few hours >>> of code review: >>> > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > > > -reveals that after the 2 System table modifications are made >>> adding species name and link, a PreparedStatement is instantiated which >>> builds and calls the base query. >>> > > > >>> > > > -The base query called is: ("SELECT value, type " + "FROM >>> genenametype INNER JOIN entrytype_genetype " + >>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>> > > > >>> > > > -So its looking in 'ordered locus' table/column for any tuple that >>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>> > > > The 'like' comparator and % usage are clear with the 'type' >>> entrytype_gene_hjid = ? >>> > > > >>> > > > -To me it seems the query makes sense so the problem is likely >>> elsewhere. >>> > > > >>> > > > GenMappBuilder.java: >>> > > > -I found method doTallies() at code line 895 which: >>> > > > Instantiates a Configuration called hibernateConfiguration and >>> assigns to it the current hibernate configuration >>> > > > Validates database settings by analyzing hibernateConfiguration >>> > > > Instantiates a CriterionList for uniprot and assigns to it >>> TallyType.UNIPROT >>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>> > > > Determines if both xml files exist >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >>> xml files and their respective CriterionList >>> > > > Results are then formatted for display in a table. >>> > > > >>> > > > -So enum TallyType which means that they are the only valid >>> datatypes which TallyEngine accepts... go to know ... >>> > > > >>> > > > -Based on the screen shot of Tally Engine it would seem that both >>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >>> returning. Likely due to both using an incorrect query (as we previously >>> supposed). But where are the queries?... the more I dig the more I think >>> they are in the criterial all the work is done against. >>> > > > >>> > > > continuing the review: >>> > > > getTallyResultsXML() calls Tally Engine instance method >>> getXmlFileCounts(xmlFile) >>> > > > getTallyResultsDatabase() calls Tally Engine instance method >>> getDbcounts(new QueryEngine(hibernateConfiguration) >>> > > > Both of these instanced methods originate from TallyEngine.java... >>> > > > >>> > > > TallyEngine.java: >>> > > > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >>> digester then processes against criteria... but this quickly becomes >>> confusing and is hard to follow >>> > > > >>> > > > getDbcounts() then starts a db session and executes a query but >>> then I also get a bit lost with my limited db knowledge. >>> > > > >>> > > > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > > > >>> > > > OVERALL I think I'm getting closer to the issues but I still feel >>> as if I'm missing some understanding to proceed further. Can you pass along >>> some of that Dondi insight and steer me in the right direction? =D >>> > > > >>> > > > -DB Tally - Not having taken databases yet certainly is limiting my >>> ability determine where the "criteria" are being set and how they are >>> followed during session activities. Also is the query we have been looking >>> for this whole time in the criteria or someplace else? >>> > > > >>> > > > -XML Tally - again is the query contained within the criteria that >>> digestXmlFile() uses to parse? >>> > > > >>> > > > Richard >>> > > > >>> > > > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > > Right, schema issues are unlikely. Most count discrepancies like >>> this that I've seen have boiled down to forming the right query. Then, >>> knowing the right query (in both XML and SQL), it's a matter of making sure >>> that TallyEngine asks that same query. >>> > > > >>> > > > John David N. Dionisio, PhD >>> > > > Associate Professor, Computer Science >>> > > > Loyola Marymount University >>> > > > >>> > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > > > >>> > > > > OK, so based on your approach: >>> > > > > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >>> sql queries needed for the respective results as you requested. >>> > > > > >>> > > > > I was also thinking I may need to review the schema from xml into >>> postgres but the issue isn't likely a schema error. The error most likely >>> lies in how xmlpipedbutils queries the data from xml source and writes to >>> the tables what it returns? >>> > > > > >>> > > > > 2. I'll review the code: trace the entrance of tally engine in >>> the gmbuilder code then follow it through the xmlpipedbutils. >>> > > > > >>> > > > > Richard >>> > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >>> SourceForge's Subversion repo. >>> > > > > >>> > > > > John David N. Dionisio, PhD >>> > > > > Associate Professor, Computer Science >>> > > > > Loyola Marymount University >>> > > > > >>> > > > > >>> > > > > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > > > > >>> > > > > > Hi Dondi, >>> > > > > > >>> > > > > > So I'm at the point in working with M tuberculosis that I was >>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>> > > > > > >>> > > > > > gmb2b60 Results >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > Now the proverbial question - What next to solve the Ordered >>> Locus import/count issue? >>> > > > > > >>> > > > > > ********************************************** >>> > > > > > Here is my thought process: >>> > > > > > >>> > > > > > Step 1: How does the import process work at the high level? >>> (obviously correct me if I'm wrong) >>> > > > > > >>> > > > > > I believe that basically as each XML tag is read, it is placed >>> in the proper Postgres table(s) based on some criteria. There is also likely >>> some sort of check that each individual tag is in valid XML format unless we >>> don't care at this stage (care at export) or maybe the parser just skips >>> over and goes on to the next . >>> > > > > > >>> > > > > > Step 2: What could be the problem? >>> > > > > > >>> > > > > > Either - >>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>> > > > > > b. Decision criteria of which table they should be added to? >>> > > > > > >>> > > > > > ********************************************** >>> > > > > > >>> > > > > > I read on the sourceforge wiki: >>> > > > > > >>> > > > > > XMLPipeDB has a modular architecture with three components that >>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >>> Definition) and automatically generates an SQL schema, Java classes, and >>> Hibernate mappings. XMLPipeDB Utilities provides functionality for >>> configuring the database, importing data, and performing queries. GenMAPP >>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >>> Gene Databases based on data from UniProt and Gene Ontology (GO). >>> > > > > > >>> > > > > > So I should probably start with the XMLPipeDB Utilities which >>> are where? I don't see any in the basic distribution or are they not >>> standalone and called from the command line? >>> > > > > > >>> > > > > > Thanks! >>> > > > > > >>> > > > > > Richard >>> > > > > >>> > > > > >>> > > > > <ATT00001..txt><ATT00002..txt> >>> > > > >>> > > > >>> > > > >>> ------------------------------------------------------------------------------ >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >>> Studio XE: >>> > > > Pinpoint memory and threading errors before they happen. >>> > > > Find and fix more than 250 security defects in the development >>> cycle. >>> > > > Locate bottlenecks in serial and parallel code that limit >>> performance. >>> > > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > > _______________________________________________ >>> > > > xmlpipedb-developer mailing list >>> > > > xml...@li... >>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > > >>> > > > <ATT00001..txt><ATT00002..txt> >>> > > >>> > > >>> > > >>> ------------------------------------------------------------------------------ >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > > Pinpoint memory and threading errors before they happen. >>> > > Find and fix more than 250 security defects in the development cycle. >>> > > Locate bottlenecks in serial and parallel code that limit >>> performance. >>> > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > _______________________________________________ >>> > > xmlpipedb-developer mailing list >>> > > xml...@li... >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > >>> > > <ATT00001..txt><ATT00002..txt> >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > Pinpoint memory and threading errors before they happen. >>> > Find and fix more than 250 security defects in the development cycle. >>> > Locate bottlenecks in serial and parallel code that limit performance. >>> > http://p.sf.net/sfu/intel-dev2devfeb >>> > _______________________________________________ >>> > xmlpipedb-developer mailing list >>> > xml...@li... >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > >>> > >>> > >>> > <ATT00001..txt><ATT00002..txt> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> Pinpoint memory and threading errors before they happen. >>> Find and fix more than 250 security defects in the development cycle. >>> Locate bottlenecks in serial and parallel code that limit performance. >>> http://p.sf.net/sfu/intel-dev2devfeb >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >> >> > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the > cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: John D. N. D. <do...@lm...> - 2011-02-22 01:17:55
|
I don't recall the exact details of the missing 69, but if your query successfully returns them in raw SQL, then this is worth a try. You can integrate into the same query as long as the same columns are returned, which is the case here AFAIK, so go ahead and extend the existing query. John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: > public > > TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { > // Build the base query; we only use "ordered locus" and we only want > // IDs that begin with "Rv." > PreparedStatement ps = ConnectionManager.getRelationalDBConnection().prepareStatement( > > "SELECT value, type " + > "FROM genenametype INNER JOIN entrytype_genetype " + > "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + > "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); > ResultSet result; > > for (Row row : primarySystemTableManager.getRows()) { > ps.setInt(1, Integer.parseInt(row.getValue( > > "UID"))); > result = ps.executeQuery(); > > // We actually want to keep the case where multiple ordered locus > // names appear. > while (result.next()) { > // We want this name to appear in the OrderedLocusNames > // system table. > for (String id : result.getString("value").split("/")) { > tableManager.submit( > > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", row.getValue("UID") } }); > } > > } > > } > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". > > I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. > > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " > > Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. > > Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? > > Richard > > > > On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > > thanks and will do as directed. > > My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? > > <code> > blah > blah > blah > </code> > > thanks! > > Richard > > On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...> wrote: > > Greetings, > > Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. > > Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. > > However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) > > - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. > > - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. > > These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. > > Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > > On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > > > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > > > Richard > > > > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: > > OK, have some updates and some suggestions: > > > > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. > > > > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. > > > > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. > > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." > > > > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: > > # Mycobacterium tuberculosis > > > > mycobacteriumtuberculosis_level_amount= > > > > 1 > > > > mycobacteriumtuberculosis_element_level0= > > > > uniprot/entry/gene/name&type&ordered locus > > > > mycobacteriumtuberculosis_query_level0= > > > > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > > > > mycobacteriumtuberculosis_table_name_level0= > > > > Ordered Locus > > SOLUTIONS: > > > > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. > > > > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: > > > > > > # Mycobacterium tuberculosis > > > > mycobacteriumtuberculosis_level_amount=2 > > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > > > > > mycobacteriumtuberculosis_query_level0= > > > > select count(*) from genenametype where type = 'ordered locus'; > > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; > > > > > > mycobacteriumtuberculosis_table_name_level0= > > > > Ordered Locus > > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > > > ---------------------------------------------------------------------------------------------------------------------------- > > > > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. > > > > Richard > > > > > > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > > Just got done reading previous email and understand the change in priority. > > > > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. > > > > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D > > > > Will dig further into the missing ID's later tonight and during day tomorrow and report back. > > > > Richard > > > > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: > > Hi Rich, > > > > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > > > > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". > > > > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > > > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. > > > > > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. > > > > > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? > > > > > > Richard > > > > > > > > > > > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: > > > Hi Rich, > > > > > > Sorry for the delay. Had some distractions coming into the weekend. > > > > > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) > > > > > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): > > > > > > # Mycobacterium tuberculosis > > > mycobacteriumtuberculosis_level_amount=1 > > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > > > > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > > > > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > > > > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. > > > > > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). > > > > > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > > > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > > > > > > OK here is what I was able to put together from the past few hours of code review: > > > > > > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. > > > > > > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > > > > > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . > > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? > > > > > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > > > > > > > GenMappBuilder.java: > > > > -I found method doTallies() at code line 895 which: > > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration > > > > Validates database settings by analyzing hibernateConfiguration > > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT > > > > Instantiates a CriterionList for go and assigns to it TallyType.GO > > > > Determines if both xml files exist > > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList > > > > Results are then formatted for display in a table. > > > > > > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... > > > > > > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. > > > > > > > > continuing the review: > > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) > > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) > > > > Both of these instanced methods originate from TallyEngine.java... > > > > > > > > TallyEngine.java: > > > > > > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow > > > > > > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. > > > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > > > > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D > > > > > > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? > > > > > > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? > > > > > > > > Richard > > > > > > > > > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: > > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. > > > > > > > > John David N. Dionisio, PhD > > > > Associate Professor, Computer Science > > > > Loyola Marymount University > > > > > > > > > > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > > > > > > > > OK, so based on your approach: > > > > > > > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. > > > > > > > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? > > > > > > > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. > > > > > > > > > > Richard > > > > > > > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: > > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > > > > > > > > > > John David N. Dionisio, PhD > > > > > Associate Professor, Computer Science > > > > > Loyola Marymount University > > > > > > > > > > > > > > > > > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > > > > > > > > > > Hi Dondi, > > > > > > > > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > > > > > > > > > > > > gmb2b60 Results > > > > > > > > > > > > > > > > > > > > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? > > > > > > > > > > > > ********************************************** > > > > > > Here is my thought process: > > > > > > > > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) > > > > > > > > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . > > > > > > > > > > > > Step 2: What could be the problem? > > > > > > > > > > > > Either - > > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > > > > > b. Decision criteria of which table they should be added to? > > > > > > > > > > > > ********************************************** > > > > > > > > > > > > I read on the sourceforge wiki: > > > > > > > > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). > > > > > > > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Richard > > > > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > > > Pinpoint memory and threading errors before they happen. > > > > Find and fix more than 250 security defects in the development cycle. > > > > Locate bottlenecks in serial and parallel code that limit performance. > > > > http://p.sf.net/sfu/intel-dev2devfeb > > > > _______________________________________________ > > > > xmlpipedb-developer mailing list > > > > xml...@li... > > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > ------------------------------------------------------------------------------ > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > > Pinpoint memory and threading errors before they happen. > > > Find and fix more than 250 security defects in the development cycle. > > > Locate bottlenecks in serial and parallel code that limit performance. > > > http://p.sf.net/sfu/intel-dev2devfeb > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > ------------------------------------------------------------------------------ > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > Pinpoint memory and threading errors before they happen. > > Find and fix more than 250 security defects in the development cycle. > > Locate bottlenecks in serial and parallel code that limit performance. > > http://p.sf.net/sfu/intel-dev2devfeb > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > ------------------------------------------------------------------------------ > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > Pinpoint memory and threading errors before they happen. > Find and fix more than 250 security defects in the development cycle. > Locate bottlenecks in serial and parallel code that limit performance. > http://p.sf.net/sfu/intel-dev2devfeb > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk > Collect, index and harness all the fast moving IT data generated by your > applications, servers and devices whether physical, virtual or in the cloud. > Deliver compliance at lower cost and gain new business insights. > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer |
From: Richard B. <rbr...@gm...> - 2011-02-21 23:56:37
|
So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: * public *TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) *throws* SQLException, InvalidParameterException { // Build the base query; we only use "ordered locus" and we only want // IDs that begin with "*Rv*." PreparedStatement ps = ConnectionManager.*getRelationalDBConnection* ().prepareStatement( "SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); ResultSet result; *for* (Row row : primarySystemTableManager.getRows()) { ps.setInt(1, Integer.*parseInt*(row.getValue( "UID"))); result = ps.executeQuery(); // We actually want to keep the case where multiple ordered locus // names appear. *while* (result.next()) { // We want this name to appear in the OrderedLocusNames // system table. *for* (String id : result.getString("value").split("/")) { tableManager.submit( "OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID", row.getValue("UID") } }); } } } ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? Richard On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > thanks and will do as directed. > > My previous, last paragraph comment - A way for programming code in email > holding its format in a mail message similarly to how you can post code on > forum pages? > > <code> > blah > blah > blah > </code> > > thanks! > > Richard > > On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...>wrote: > > >> Greetings, >> >> Actually, gmbuilder.properties is for the TallyEngine only. When dealing >> with .gdb exports, look *only* at the SpeciesProfile class. So, to find >> those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile >> code, that needs to be changed. >> >> Your take on how gmbuilder.properties is used, however, is understandable. >> It makes sense to assume that the TallyEngine code *and* the ID export code >> are based on the same characterization of the needed IDs. This replication >> is originally a historical artifact: SpeciesProfile was done first, and then >> TallyEngine was done later by another student. >> >> However, there are other factors beyond history that sort of necessitate >> this duplication of desired IDs: (skip the two bullets below if you'd rather >> cut to the chase of the work to be done, and discuss design issues later) >> >> - The actual XML import code is a black box: this is the "canned" JAXB >> library actually in action, and not our code at all. Plus, the XML import >> code really does not filter (nor should it), since the goal of the >> XML->relational database step is to fully capture the XML data in the >> relational database. So, XML count is necessarily separated from XML >> import. >> >> - The notion of a declarative mechanism for extracting IDs from the >> relational database (which is what gmbuilder.properties/TallyEngine uses) is >> interesting, but at the same time there is value in the arbitrary >> computation that can be done with Java (case in point: export two versions >> of an ID, with and without periods). This is not to say that it is >> impossible to do this declaratively, but let's just say that the procedural >> approach exists here and now, and a declarative approach will need more >> thought. >> >> These, and other factors, are good thoughts to hold onto and would be >> worthy of a good meeting discussion sometime, but bottom line for now: >> modifying the export behavior is a matter of editing the *SpeciesProfile* >> Java code, and not the gmbuilder.properties file. Turn your attention to >> that code. >> >> Now, as to annotating your code...I'd just put in code comments :) Or >> did you mean something else by tagging code in e-mail? >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> >> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> >> > also, how do I tag code in email so it holds its formatting? I tried a >> few suggestions I found on the web but they aren't holding formatting or i'm >> just doing it wrong ;-D >> > >> > Richard >> > >> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> >> wrote: >> > OK, have some updates and some suggestions: >> > >> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing >> report. We verified that XML match does indeed find 4066 unique matches - 75 >> of which are not in the gdb and need to be. >> > >> > Dr. Dahlquist informed me that she was the one who completed the gene db >> testing report, not a previous student of BIO367 and had already verified >> which genes were missing and where they were to be found. I had (mistakenly) >> assumed that since a student had performed the gene database testing I had >> to redo all of the verification. >> > >> > So that said, of the 75 genes missing - 69 need to be included and 6 >> excluded. >> > Per the gene db testing report: "69 of them have an "a", "b", or "d" >> suffix. They are all found in the ORF tag and need to be included in the >> gdb." >> > >> > To solve this we need to add additional search criteria into the M. >> tuberculosis section in gmbuilder.properties below: >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount= >> > >> > 1 >> > >> > mycobacteriumtuberculosis_element_level0= >> > >> > uniprot/entry/gene/name&type&ordered locus >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus' and value >> like 'Rv%'; >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > SOLUTIONS: >> > >> > 1. So am i correct in my understanding that the second line is the query >> used by TallyEngine to read the XML file? If so then this is the issue we >> need to table for the moment until we get the gbd verified and re-released. >> We will revisit this to discover why it is not only reporting incorrectly >> but also why its added a second row of Ordered Locus on the TallyEngine >> results page. >> > >> > 2. The third line is the SQL query used by postgres during the export >> from XML to gdb. To find and get the ORF tagged genes could we not add the >> following lines and change the count in the first line: >> > >> > >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount=2 >> > >> > >> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus'; >> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype >> where type = 'orf'; >> > >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >> > >> ---------------------------------------------------------------------------------------------------------------------------- >> > >> > Of course these queries would have be manually verified prior to making >> these changes but this seems like we are moving in the right direction. >> > >> > Richard >> > >> > >> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> >> wrote: >> > Just got done reading previous email and understand the change in >> priority. >> > >> > Will work on the missing ID's for now and shelve the the TalleyEngine >> issue for the moment. >> > >> > Also great about a more formalized weekly meeting. I was going to >> suggest it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> > >> > Will dig further into the missing ID's later tonight and during day >> tomorrow and report back. >> > >> > Richard >> > >> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > Hi Rich, >> > >> > No problem. The pertinent line you're referring to, for XML, is this, >> right above the line you copied: >> > >> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >> > The slash-separated section is the "path" of XML tags leading to the >> element of interest; then, after the ampersand, is a name/value pair for the >> desired attribute to count. Note that there is no hint of a *content*-based >> filter (nor is there the capability for one, as far as I can tell in the >> code). By "content," I mean that we can't specify filters based on what's >> *between* the tags. We can only go as far as filter by attribute value, >> e.g., type="ordered locus". >> > >> > But anyway, as mentioned in the earlier e-mail, let's have the missing >> IDs in the .gdb take precedence for now. Please take a look at the >> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >> the ID output can be customized, then let me know if you have any questions >> or need to confirm anything. >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >> > > Sorry been slammed with a programming assignment that kept needing >> continued iteration and it has been all consuming until last night. But I >> did get a chance to work with your comments and review the code again with a >> different mind set. >> > > >> > > Yes, I examined the gmbuilder.properties file ( the query is also in >> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >> previous email ) but I don't think I see what you mean regarding the XML >> count. >> > > >> > > I understood that: mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... or do they >> share the same query and you are saying that XML count doesn't recognize and >> therefore cannot use the 'Rv%' parameter? >> > > >> > > Richard >> > > >> > > >> > > >> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > Hi Rich, >> > > >> > > Sorry for the delay. Had some distractions coming into the weekend. >> > > >> > > You've looked at the code; have you looked at gmbuilder.properties? >> (I may have mentioned it a few e-mails ago, just as you were starting to >> dig into this) >> > > >> > > On the copy I have, the M. tuberculosis block looks like this >> (indentation is mine to set it apart): >> > > >> > > # Mycobacterium tuberculosis >> > > mycobacteriumtuberculosis_level_amount=1 >> > > >> > > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > > >> > > mycobacteriumtuberculosis_query_level0=select count(*) from >> genenametype where type = 'ordered locus' and value like 'Rv%'; >> > > >> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > > >> > > There, I think, is the rub. Notice that the XML count does not filter >> on RV%. The SQL query does. >> > > >> > > Unfortunately, I don't think the TallyEngine can include selective >> filtering in the XML counts. If the need to do selective filtering on XML >> is necessary, then I think we're looking at a new functionality for you to >> implement (or, if this throws things off too much, this may have to be noted >> somewhere, that the XML vs. database counts may be off because the database >> count is doing some text-based filtering but the XML count does not). >> > > >> > > What does xmlpipedb-match say? That will at least tell you whether >> the 'RV%' count is indeed correct. >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > >> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > > >> > > > OK here is what I was able to put together from the past few hours >> of code review: >> > > > >> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > > > -reveals that after the 2 System table modifications are made adding >> species name and link, a PreparedStatement is instantiated which builds and >> calls the base query. >> > > > >> > > > -The base query called is: ("SELECT value, type " + "FROM >> genenametype INNER JOIN entrytype_genetype " + >> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >> > > > >> > > > -So its looking in 'ordered locus' table/column for any tuple that >> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >> > > > The 'like' comparator and % usage are clear with the 'type' >> entrytype_gene_hjid = ? >> > > > >> > > > -To me it seems the query makes sense so the problem is likely >> elsewhere. >> > > > >> > > > GenMappBuilder.java: >> > > > -I found method doTallies() at code line 895 which: >> > > > Instantiates a Configuration called hibernateConfiguration and >> assigns to it the current hibernate configuration >> > > > Validates database settings by analyzing hibernateConfiguration >> > > > Instantiates a CriterionList for uniprot and assigns to it >> TallyType.UNIPROT >> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >> > > > Determines if both xml files exist >> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >> xml files and their respective CriterionList >> > > > Results are then formatted for display in a table. >> > > > >> > > > -So enum TallyType which means that they are the only valid >> datatypes which TallyEngine accepts... go to know ... >> > > > >> > > > -Based on the screen shot of Tally Engine it would seem that both >> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >> returning. Likely due to both using an incorrect query (as we previously >> supposed). But where are the queries?... the more I dig the more I think >> they are in the criterial all the work is done against. >> > > > >> > > > continuing the review: >> > > > getTallyResultsXML() calls Tally Engine instance method >> getXmlFileCounts(xmlFile) >> > > > getTallyResultsDatabase() calls Tally Engine instance method >> getDbcounts(new QueryEngine(hibernateConfiguration) >> > > > Both of these instanced methods originate from TallyEngine.java... >> > > > >> > > > TallyEngine.java: >> > > > >> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >> digester then processes against criteria... but this quickly becomes >> confusing and is hard to follow >> > > > >> > > > getDbcounts() then starts a db session and executes a query but then >> I also get a bit lost with my limited db knowledge. >> > > > >> > > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > > > >> > > > OVERALL I think I'm getting closer to the issues but I still feel as >> if I'm missing some understanding to proceed further. Can you pass along >> some of that Dondi insight and steer me in the right direction? =D >> > > > >> > > > -DB Tally - Not having taken databases yet certainly is limiting my >> ability determine where the "criteria" are being set and how they are >> followed during session activities. Also is the query we have been looking >> for this whole time in the criteria or someplace else? >> > > > >> > > > -XML Tally - again is the query contained within the criteria that >> digestXmlFile() uses to parse? >> > > > >> > > > Richard >> > > > >> > > > >> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > > Right, schema issues are unlikely. Most count discrepancies like >> this that I've seen have boiled down to forming the right query. Then, >> knowing the right query (in both XML and SQL), it's a matter of making sure >> that TallyEngine asks that same query. >> > > > >> > > > John David N. Dionisio, PhD >> > > > Associate Professor, Computer Science >> > > > Loyola Marymount University >> > > > >> > > > >> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > > > >> > > > > OK, so based on your approach: >> > > > > >> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >> sql queries needed for the respective results as you requested. >> > > > > >> > > > > I was also thinking I may need to review the schema from xml into >> postgres but the issue isn't likely a schema error. The error most likely >> lies in how xmlpipedbutils queries the data from xml source and writes to >> the tables what it returns? >> > > > > >> > > > > 2. I'll review the code: trace the entrance of tally engine in the >> gmbuilder code then follow it through the xmlpipedbutils. >> > > > > >> > > > > Richard >> > > > > >> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >> do...@lm...> wrote: >> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >> SourceForge's Subversion repo. >> > > > > >> > > > > John David N. Dionisio, PhD >> > > > > Associate Professor, Computer Science >> > > > > Loyola Marymount University >> > > > > >> > > > > >> > > > > >> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > > > > >> > > > > > Hi Dondi, >> > > > > > >> > > > > > So I'm at the point in working with M tuberculosis that I was >> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >> > > > > > >> > > > > > gmb2b60 Results >> > > > > > >> > > > > > >> > > > > > >> > > > > > Now the proverbial question - What next to solve the Ordered >> Locus import/count issue? >> > > > > > >> > > > > > ********************************************** >> > > > > > Here is my thought process: >> > > > > > >> > > > > > Step 1: How does the import process work at the high level? >> (obviously correct me if I'm wrong) >> > > > > > >> > > > > > I believe that basically as each XML tag is read, it is placed >> in the proper Postgres table(s) based on some criteria. There is also likely >> some sort of check that each individual tag is in valid XML format unless we >> don't care at this stage (care at export) or maybe the parser just skips >> over and goes on to the next . >> > > > > > >> > > > > > Step 2: What could be the problem? >> > > > > > >> > > > > > Either - >> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >> > > > > > b. Decision criteria of which table they should be added to? >> > > > > > >> > > > > > ********************************************** >> > > > > > >> > > > > > I read on the sourceforge wiki: >> > > > > > >> > > > > > XMLPipeDB has a modular architecture with three components that >> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >> Definition) and automatically generates an SQL schema, Java classes, and >> Hibernate mappings. XMLPipeDB Utilities provides functionality for >> configuring the database, importing data, and performing queries. GenMAPP >> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >> Gene Databases based on data from UniProt and Gene Ontology (GO). >> > > > > > >> > > > > > So I should probably start with the XMLPipeDB Utilities which >> are where? I don't see any in the basic distribution or are they not >> standalone and called from the command line? >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > Richard >> > > > > >> > > > > >> > > > > <ATT00001..txt><ATT00002..txt> >> > > > >> > > > >> > > > >> ------------------------------------------------------------------------------ >> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > > > Pinpoint memory and threading errors before they happen. >> > > > Find and fix more than 250 security defects in the development >> cycle. >> > > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > > > http://p.sf.net/sfu/intel-dev2devfeb >> > > > _______________________________________________ >> > > > xmlpipedb-developer mailing list >> > > > xml...@li... >> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > > >> > > > <ATT00001..txt><ATT00002..txt> >> > > >> > > >> > > >> ------------------------------------------------------------------------------ >> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > > Pinpoint memory and threading errors before they happen. >> > > Find and fix more than 250 security defects in the development cycle. >> > > Locate bottlenecks in serial and parallel code that limit performance. >> > > http://p.sf.net/sfu/intel-dev2devfeb >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > >> ------------------------------------------------------------------------------ >> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > Pinpoint memory and threading errors before they happen. >> > Find and fix more than 250 security defects in the development cycle. >> > Locate bottlenecks in serial and parallel code that limit performance. >> > http://p.sf.net/sfu/intel-dev2devfeb >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> >> ------------------------------------------------------------------------------ >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > > |
From: Richard B. <rbr...@gm...> - 2011-02-21 06:21:19
|
thanks and will do as directed. My previous, last paragraph comment - A way for programming code in email holding its format in a mail message similarly to how you can post code on forum pages? <code> blah blah blah </code> thanks! Richard On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...>wrote: > Greetings, > > Actually, gmbuilder.properties is for the TallyEngine only. When dealing > with .gdb exports, look *only* at the SpeciesProfile class. So, to find > those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile > code, that needs to be changed. > > Your take on how gmbuilder.properties is used, however, is understandable. > It makes sense to assume that the TallyEngine code *and* the ID export code > are based on the same characterization of the needed IDs. This replication > is originally a historical artifact: SpeciesProfile was done first, and then > TallyEngine was done later by another student. > > However, there are other factors beyond history that sort of necessitate > this duplication of desired IDs: (skip the two bullets below if you'd rather > cut to the chase of the work to be done, and discuss design issues later) > > - The actual XML import code is a black box: this is the "canned" JAXB > library actually in action, and not our code at all. Plus, the XML import > code really does not filter (nor should it), since the goal of the > XML->relational database step is to fully capture the XML data in the > relational database. So, XML count is necessarily separated from XML > import. > > - The notion of a declarative mechanism for extracting IDs from the > relational database (which is what gmbuilder.properties/TallyEngine uses) is > interesting, but at the same time there is value in the arbitrary > computation that can be done with Java (case in point: export two versions > of an ID, with and without periods). This is not to say that it is > impossible to do this declaratively, but let's just say that the procedural > approach exists here and now, and a declarative approach will need more > thought. > > These, and other factors, are good thoughts to hold onto and would be > worthy of a good meeting discussion sometime, but bottom line for now: > modifying the export behavior is a matter of editing the *SpeciesProfile* > Java code, and not the gmbuilder.properties file. Turn your attention to > that code. > > Now, as to annotating your code...I'd just put in code comments :) Or did > you mean something else by tagging code in e-mail? > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > > > also, how do I tag code in email so it holds its formatting? I tried a > few suggestions I found on the web but they aren't holding formatting or i'm > just doing it wrong ;-D > > > > Richard > > > > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> > wrote: > > OK, have some updates and some suggestions: > > > > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing > report. We verified that XML match does indeed find 4066 unique matches - 75 > of which are not in the gdb and need to be. > > > > Dr. Dahlquist informed me that she was the one who completed the gene db > testing report, not a previous student of BIO367 and had already verified > which genes were missing and where they were to be found. I had (mistakenly) > assumed that since a student had performed the gene database testing I had > to redo all of the verification. > > > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > > Per the gene db testing report: "69 of them have an "a", "b", or "d" > suffix. They are all found in the ORF tag and need to be included in the > gdb." > > > > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > > # Mycobacterium tuberculosis > > > > mycobacteriumtuberculosis_level_amount= > > > > 1 > > > > mycobacteriumtuberculosis_element_level0= > > > > uniprot/entry/gene/name&type&ordered locus > > > > mycobacteriumtuberculosis_query_level0= > > > > select count(*) from genenametype where type = 'ordered locus' and value > like 'Rv%'; > > > > mycobacteriumtuberculosis_table_name_level0= > > > > Ordered Locus > > SOLUTIONS: > > > > 1. So am i correct in my understanding that the second line is the query > used by TallyEngine to read the XML file? If so then this is the issue we > need to table for the moment until we get the gbd verified and re-released. > We will revisit this to discover why it is not only reporting incorrectly > but also why its added a second row of Ordered Locus on the TallyEngine > results page. > > > > 2. The third line is the SQL query used by postgres during the export > from XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > > > > > > # Mycobacterium tuberculosis > > > > mycobacteriumtuberculosis_level_amount=2 > > > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > > > > > mycobacteriumtuberculosis_query_level0= > > > > select count(*) from genenametype where type = 'ordered locus'; > > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype > where type = 'orf'; > > > > > > mycobacteriumtuberculosis_table_name_level0= > > > > Ordered Locus > > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > > > > ---------------------------------------------------------------------------------------------------------------------------- > > > > Of course these queries would have be manually verified prior to making > these changes but this seems like we are moving in the right direction. > > > > Richard > > > > > > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> > wrote: > > Just got done reading previous email and understand the change in > priority. > > > > Will work on the missing ID's for now and shelve the the TalleyEngine > issue for the moment. > > > > Also great about a more formalized weekly meeting. I was going to suggest > it myself as it has been slow going so far as maybe i'm a bit too > independent in this independent study class =D > > > > Will dig further into the missing ID's later tonight and during day > tomorrow and report back. > > > > Richard > > > > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > No problem. The pertinent line you're referring to, for XML, is this, > right above the line you copied: > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > > > The slash-separated section is the "path" of XML tags leading to the > element of interest; then, after the ampersand, is a name/value pair for the > desired attribute to count. Note that there is no hint of a *content*-based > filter (nor is there the capability for one, as far as I can tell in the > code). By "content," I mean that we can't specify filters based on what's > *between* the tags. We can only go as far as filter by attribute value, > e.g., type="ordered locus". > > > > But anyway, as mentioned in the earlier e-mail, let's have the missing > IDs in the .gdb take precedence for now. Please take a look at the > tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how > the ID output can be customized, then let me know if you have any questions > or need to confirm anything. > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > > > > Sorry been slammed with a programming assignment that kept needing > continued iteration and it has been all consuming until last night. But I > did get a chance to work with your comments and review the code again with a > different mind set. > > > > > > Yes, I examined the gmbuilder.properties file ( the query is also in > the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > > > > > > I understood that: mycobacteriumtuberculosis_query_level0=select > count(*) from genenametype where type = 'ordered locus' and value like > 'Rv%'; was the db query but don't see which is the XML count... or do they > share the same query and you are saying that XML count doesn't recognize and > therefore cannot use the 'Rv%' parameter? > > > > > > Richard > > > > > > > > > > > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > > > Hi Rich, > > > > > > Sorry for the delay. Had some distractions coming into the weekend. > > > > > > You've looked at the code; have you looked at gmbuilder.properties? (I > may have mentioned it a few e-mails ago, just as you were starting to dig > into this) > > > > > > On the copy I have, the M. tuberculosis block looks like this > (indentation is mine to set it apart): > > > > > > # Mycobacterium tuberculosis > > > mycobacteriumtuberculosis_level_amount=1 > > > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > > > > > mycobacteriumtuberculosis_query_level0=select count(*) from > genenametype where type = 'ordered locus' and value like 'Rv%'; > > > > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > > > > > There, I think, is the rub. Notice that the XML count does not filter > on RV%. The SQL query does. > > > > > > Unfortunately, I don't think the TallyEngine can include selective > filtering in the XML counts. If the need to do selective filtering on XML > is necessary, then I think we're looking at a new functionality for you to > implement (or, if this throws things off too much, this may have to be noted > somewhere, that the XML vs. database counts may be off because the database > count is doing some text-based filtering but the XML count does not). > > > > > > What does xmlpipedb-match say? That will at least tell you whether the > 'RV%' count is indeed correct. > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > > > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > > > > > > OK here is what I was able to put together from the past few hours of > code review: > > > > > > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > > > -reveals that after the 2 System table modifications are made adding > species name and link, a PreparedStatement is instantiated which builds and > calls the base query. > > > > > > > > -The base query called is: ("SELECT value, type " + "FROM > genenametype INNER JOIN entrytype_genetype " + > "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type > = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > > > > > > > -So its looking in 'ordered locus' table/column for any tuple that > starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . > > > > The 'like' comparator and % usage are clear with the 'type' > entrytype_gene_hjid = ? > > > > > > > > -To me it seems the query makes sense so the problem is likely > elsewhere. > > > > > > > > GenMappBuilder.java: > > > > -I found method doTallies() at code line 895 which: > > > > Instantiates a Configuration called hibernateConfiguration and > assigns to it the current hibernate configuration > > > > Validates database settings by analyzing hibernateConfiguration > > > > Instantiates a CriterionList for uniprot and assigns to it > TallyType.UNIPROT > > > > Instantiates a CriterionList for go and assigns to it TallyType.GO > > > > Determines if both xml files exist > > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both > xml files and their respective CriterionList > > > > Results are then formatted for display in a table. > > > > > > > > -So enum TallyType which means that they are the only valid datatypes > which TallyEngine accepts... go to know ... > > > > > > > > -Based on the screen shot of Tally Engine it would seem that both > getTallyResultsXML() and getTallyResultsDatabase() are incorrectly > returning. Likely due to both using an incorrect query (as we previously > supposed). But where are the queries?... the more I dig the more I think > they are in the criterial all the work is done against. > > > > > > > > continuing the review: > > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > > > > getTallyResultsDatabase() calls Tally Engine instance method > getDbcounts(new QueryEngine(hibernateConfiguration) > > > > Both of these instanced methods originate from TallyEngine.java... > > > > > > > > TallyEngine.java: > > > > > > > > getXmlFileCounts() calls digestXmlFile() which instantiates a > digester then processes against criteria... but this quickly becomes > confusing and is hard to follow > > > > > > > > getDbcounts() then starts a db session and executes a query but then > I also get a bit lost with my limited db knowledge. > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > > > > > > OVERALL I think I'm getting closer to the issues but I still feel as > if I'm missing some understanding to proceed further. Can you pass along > some of that Dondi insight and steer me in the right direction? =D > > > > > > > > -DB Tally - Not having taken databases yet certainly is limiting my > ability determine where the "criteria" are being set and how they are > followed during session activities. Also is the query we have been looking > for this whole time in the criteria or someplace else? > > > > > > > > -XML Tally - again is the query contained within the criteria that > digestXmlFile() uses to parse? > > > > > > > > Richard > > > > > > > > > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < > do...@lm...> wrote: > > > > Right, schema issues are unlikely. Most count discrepancies like > this that I've seen have boiled down to forming the right query. Then, > knowing the right query (in both XML and SQL), it's a matter of making sure > that TallyEngine asks that same query. > > > > > > > > John David N. Dionisio, PhD > > > > Associate Professor, Computer Science > > > > Loyola Marymount University > > > > > > > > > > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > > > > > > > > OK, so based on your approach: > > > > > > > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and > sql queries needed for the respective results as you requested. > > > > > > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > > > > > > > > > > 2. I'll review the code: trace the entrance of tally engine in the > gmbuilder code then follow it through the xmlpipedbutils. > > > > > > > > > > Richard > > > > > > > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < > do...@lm...> wrote: > > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) > --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in > SourceForge's Subversion repo. > > > > > > > > > > John David N. Dionisio, PhD > > > > > Associate Professor, Computer Science > > > > > Loyola Marymount University > > > > > > > > > > > > > > > > > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > > > > > > > > > > Hi Dondi, > > > > > > > > > > > > So I'm at the point in working with M tuberculosis that I was > able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > > > > > > > > > > > > gmb2b60 Results > > > > > > > > > > > > > > > > > > > > > > > > Now the proverbial question - What next to solve the Ordered > Locus import/count issue? > > > > > > > > > > > > ********************************************** > > > > > > Here is my thought process: > > > > > > > > > > > > Step 1: How does the import process work at the high level? > (obviously correct me if I'm wrong) > > > > > > > > > > > > I believe that basically as each XML tag is read, it is placed in > the proper Postgres table(s) based on some criteria. There is also likely > some sort of check that each individual tag is in valid XML format unless we > don't care at this stage (care at export) or maybe the parser just skips > over and goes on to the next . > > > > > > > > > > > > Step 2: What could be the problem? > > > > > > > > > > > > Either - > > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > > > > > b. Decision criteria of which table they should be added to? > > > > > > > > > > > > ********************************************** > > > > > > > > > > > > I read on the sourceforge wiki: > > > > > > > > > > > > XMLPipeDB has a modular architecture with three components that > may be used separately or together. XSD-to-DB reads an XSD (XML Schema > Definition) and automatically generates an SQL schema, Java classes, and > Hibernate mappings. XMLPipeDB Utilities provides functionality for > configuring the database, importing data, and performing queries. GenMAPP > Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible > Gene Databases based on data from UniProt and Gene Ontology (GO). > > > > > > > > > > > > So I should probably start with the XMLPipeDB Utilities which are > where? I don't see any in the basic distribution or are they not standalone > and called from the command line? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Richard > > > > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > > > > Pinpoint memory and threading errors before they happen. > > > > Find and fix more than 250 security defects in the development cycle. > > > > Locate bottlenecks in serial and parallel code that limit > performance. > > > > http://p.sf.net/sfu/intel-dev2devfeb > > > > _______________________________________________ > > > > xmlpipedb-developer mailing list > > > > xml...@li... > > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > > ------------------------------------------------------------------------------ > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > > > Pinpoint memory and threading errors before they happen. > > > Find and fix more than 250 security defects in the development cycle. > > > Locate bottlenecks in serial and parallel code that limit performance. > > > http://p.sf.net/sfu/intel-dev2devfeb > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > Pinpoint memory and threading errors before they happen. > > Find and fix more than 250 security defects in the development cycle. > > Locate bottlenecks in serial and parallel code that limit performance. > > http://p.sf.net/sfu/intel-dev2devfeb > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > Pinpoint memory and threading errors before they happen. > Find and fix more than 250 security defects in the development cycle. > Locate bottlenecks in serial and parallel code that limit performance. > http://p.sf.net/sfu/intel-dev2devfeb > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-02-21 06:05:31
|
Greetings, Actually, gmbuilder.properties is for the TallyEngine only. When dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile code, that needs to be changed. Your take on how gmbuilder.properties is used, however, is understandable. It makes sense to assume that the TallyEngine code *and* the ID export code are based on the same characterization of the needed IDs. This replication is originally a historical artifact: SpeciesProfile was done first, and then TallyEngine was done later by another student. However, there are other factors beyond history that sort of necessitate this duplication of desired IDs: (skip the two bullets below if you'd rather cut to the chase of the work to be done, and discuss design issues later) - The actual XML import code is a black box: this is the "canned" JAXB library actually in action, and not our code at all. Plus, the XML import code really does not filter (nor should it), since the goal of the XML->relational database step is to fully capture the XML data in the relational database. So, XML count is necessarily separated from XML import. - The notion of a declarative mechanism for extracting IDs from the relational database (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at the same time there is value in the arbitrary computation that can be done with Java (case in point: export two versions of an ID, with and without periods). This is not to say that it is impossible to do this declaratively, but let's just say that the procedural approach exists here and now, and a declarative approach will need more thought. These, and other factors, are good thoughts to hold onto and would be worthy of a good meeting discussion sometime, but bottom line for now: modifying the export behavior is a matter of editing the *SpeciesProfile* Java code, and not the gmbuilder.properties file. Turn your attention to that code. Now, as to annotating your code...I'd just put in code comments :) Or did you mean something else by tagging code in e-mail? John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: > also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > Richard > > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: > OK, have some updates and some suggestions: > > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. > > Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. > > So that said, of the 75 genes missing - 69 need to be included and 6 excluded. > Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." > > To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: > # Mycobacterium tuberculosis > > mycobacteriumtuberculosis_level_amount= > > 1 > > mycobacteriumtuberculosis_element_level0= > > uniprot/entry/gene/name&type&ordered locus > > mycobacteriumtuberculosis_query_level0= > > select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > > mycobacteriumtuberculosis_table_name_level0= > > Ordered Locus > SOLUTIONS: > > 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. > > 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: > > > # Mycobacterium tuberculosis > > mycobacteriumtuberculosis_level_amount=2 > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > > mycobacteriumtuberculosis_query_level0= > > select count(*) from genenametype where type = 'ordered locus'; > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype where type = 'orf'; > > > mycobacteriumtuberculosis_table_name_level0= > > Ordered Locus > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > ---------------------------------------------------------------------------------------------------------------------------- > > Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. > > Richard > > > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > Just got done reading previous email and understand the change in priority. > > Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. > > Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D > > Will dig further into the missing ID's later tonight and during day tomorrow and report back. > > Richard > > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > No problem. The pertinent line you're referring to, for XML, is this, right above the line you copied: > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > > The slash-separated section is the "path" of XML tags leading to the element of interest; then, after the ampersand, is a name/value pair for the desired attribute to count. Note that there is no hint of a *content*-based filter (nor is there the capability for one, as far as I can tell in the code). By "content," I mean that we can't specify filters based on what's *between* the tags. We can only go as far as filter by attribute value, e.g., type="ordered locus". > > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs in the .gdb take precedence for now. Please take a look at the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how the ID output can be customized, then let me know if you have any questions or need to confirm anything. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > > Sorry been slammed with a programming assignment that kept needing continued iteration and it has been all consuming until last night. But I did get a chance to work with your comments and review the code again with a different mind set. > > > > Yes, I examined the gmbuilder.properties file ( the query is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a previous email ) but I don't think I see what you mean regarding the XML count. > > > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; was the db query but don't see which is the XML count... or do they share the same query and you are saying that XML count doesn't recognize and therefore cannot use the 'Rv%' parameter? > > > > Richard > > > > > > > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> wrote: > > Hi Rich, > > > > Sorry for the delay. Had some distractions coming into the weekend. > > > > You've looked at the code; have you looked at gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as you were starting to dig into this) > > > > On the copy I have, the M. tuberculosis block looks like this (indentation is mine to set it apart): > > > > # Mycobacterium tuberculosis > > mycobacteriumtuberculosis_level_amount=1 > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered locus > > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; > > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > > > There, I think, is the rub. Notice that the XML count does not filter on RV%. The SQL query does. > > > > Unfortunately, I don't think the TallyEngine can include selective filtering in the XML counts. If the need to do selective filtering on XML is necessary, then I think we're looking at a new functionality for you to implement (or, if this throws things off too much, this may have to be noted somewhere, that the XML vs. database counts may be off because the database count is doing some text-based filtering but the XML count does not). > > > > What does xmlpipedb-match say? That will at least tell you whether the 'RV%' count is indeed correct. > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > > > > OK here is what I was able to put together from the past few hours of code review: > > > > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > > -reveals that after the 2 System table modifications are made adding species name and link, a PreparedStatement is instantiated which builds and calls the base query. > > > > > > -The base query called is: ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > > > > > -So its looking in 'ordered locus' table/column for any tuple that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . > > > The 'like' comparator and % usage are clear with the 'type' entrytype_gene_hjid = ? > > > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > > > > > GenMappBuilder.java: > > > -I found method doTallies() at code line 895 which: > > > Instantiates a Configuration called hibernateConfiguration and assigns to it the current hibernate configuration > > > Validates database settings by analyzing hibernateConfiguration > > > Instantiates a CriterionList for uniprot and assigns to it TallyType.UNIPROT > > > Instantiates a CriterionList for go and assigns to it TallyType.GO > > > Determines if both xml files exist > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml files and their respective CriterionList > > > Results are then formatted for display in a table. > > > > > > -So enum TallyType which means that they are the only valid datatypes which TallyEngine accepts... go to know ... > > > > > > -Based on the screen shot of Tally Engine it would seem that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly returning. Likely due to both using an incorrect query (as we previously supposed). But where are the queries?... the more I dig the more I think they are in the criterial all the work is done against. > > > > > > continuing the review: > > > getTallyResultsXML() calls Tally Engine instance method getXmlFileCounts(xmlFile) > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new QueryEngine(hibernateConfiguration) > > > Both of these instanced methods originate from TallyEngine.java... > > > > > > TallyEngine.java: > > > > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester then processes against criteria... but this quickly becomes confusing and is hard to follow > > > > > > getDbcounts() then starts a db session and executes a query but then I also get a bit lost with my limited db knowledge. > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > > > > OVERALL I think I'm getting closer to the issues but I still feel as if I'm missing some understanding to proceed further. Can you pass along some of that Dondi insight and steer me in the right direction? =D > > > > > > -DB Tally - Not having taken databases yet certainly is limiting my ability determine where the "criteria" are being set and how they are followed during session activities. Also is the query we have been looking for this whole time in the criteria or someplace else? > > > > > > -XML Tally - again is the query contained within the criteria that digestXmlFile() uses to parse? > > > > > > Richard > > > > > > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> wrote: > > > Right, schema issues are unlikely. Most count discrepancies like this that I've seen have boiled down to forming the right query. Then, knowing the right query (in both XML and SQL), it's a matter of making sure that TallyEngine asks that same query. > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > > > > > > OK, so based on your approach: > > > > > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql queries needed for the respective results as you requested. > > > > > > > > I was also thinking I may need to review the schema from xml into postgres but the issue isn't likely a schema error. The error most likely lies in how xmlpipedbutils queries the data from xml source and writes to the tables what it returns? > > > > > > > > 2. I'll review the code: trace the entrance of tally engine in the gmbuilder code then follow it through the xmlpipedbutils. > > > > > > > > Richard > > > > > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <do...@lm...> wrote: > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > > > > > > > > John David N. Dionisio, PhD > > > > Associate Professor, Computer Science > > > > Loyola Marymount University > > > > > > > > > > > > > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > > > > > > > > Hi Dondi, > > > > > > > > > > So I'm at the point in working with M tuberculosis that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > > > > > > > > > > gmb2b60 Results > > > > > > > > > > > > > > > > > > > > Now the proverbial question - What next to solve the Ordered Locus import/count issue? > > > > > > > > > > ********************************************** > > > > > Here is my thought process: > > > > > > > > > > Step 1: How does the import process work at the high level? (obviously correct me if I'm wrong) > > > > > > > > > > I believe that basically as each XML tag is read, it is placed in the proper Postgres table(s) based on some criteria. There is also likely some sort of check that each individual tag is in valid XML format unless we don't care at this stage (care at export) or maybe the parser just skips over and goes on to the next . > > > > > > > > > > Step 2: What could be the problem? > > > > > > > > > > Either - > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > > > > b. Decision criteria of which table they should be added to? > > > > > > > > > > ********************************************** > > > > > > > > > > I read on the sourceforge wiki: > > > > > > > > > > XMLPipeDB has a modular architecture with three components that may be used separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and automatically generates an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality for configuring the database, importing data, and performing queries. GenMAPP Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on data from UniProt and Gene Ontology (GO). > > > > > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I don't see any in the basic distribution or are they not standalone and called from the command line? > > > > > > > > > > Thanks! > > > > > > > > > > Richard > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > ------------------------------------------------------------------------------ > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > > Pinpoint memory and threading errors before they happen. > > > Find and fix more than 250 security defects in the development cycle. > > > Locate bottlenecks in serial and parallel code that limit performance. > > > http://p.sf.net/sfu/intel-dev2devfeb > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > ------------------------------------------------------------------------------ > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > Pinpoint memory and threading errors before they happen. > > Find and fix more than 250 security defects in the development cycle. > > Locate bottlenecks in serial and parallel code that limit performance. > > http://p.sf.net/sfu/intel-dev2devfeb > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > ------------------------------------------------------------------------------ > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > Pinpoint memory and threading errors before they happen. > Find and fix more than 250 security defects in the development cycle. > Locate bottlenecks in serial and parallel code that limit performance. > http://p.sf.net/sfu/intel-dev2devfeb > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-02-21 05:38:09
|
also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D Richard On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: > OK, have some updates and some suggestions: > > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing > report. We verified that XML match does indeed find 4066 unique matches - 75 > of which are not in the gdb and need to be. > > Dr. Dahlquist informed me that she was the one who completed the gene db > testing report, not a previous student of BIO367 and had already verified > which genes were missing and where they were to be found. I had (mistakenly) > assumed that since a student had performed the gene database testing I had > to redo all of the verification. > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > Per the gene db testing report: "69 of them have an "a", "b", or "d" > suffix. They are all found in the ORF tag and need to be included in the > gdb." > > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > > # *Mycobacterium* tuberculosis > > mycobacteriumtuberculosis_level_amount= > 1 > > mycobacteriumtuberculosis_element_level0= > *uniprot*/entry/gene/*name&type&ordered* locus > > mycobacteriumtuberculosis_query_level0= > select count(*) from *genenametype* where type = 'ordered locus' and value > like '*Rv*%'; > > mycobacteriumtuberculosis_table_name_level0= > Ordered Locus > SOLUTIONS: > > 1. So am i correct in my understanding that the second line is the query > used by TallyEngine to read the XML file? If so then this is the issue we > need to table for the moment until we get the gbd verified and re-released. > We will revisit this to discover why it is not only reporting incorrectly > but also why its added a second row of Ordered Locus on the TallyEngine > results page. > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > > > > # *Mycobacterium* tuberculosis > > mycobacteriumtuberculosis_level_amount=2 > > mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/* > name&type&ordered* locus > > mycobacteriumtuberculosis_element_level1=*uniprot*/entry/gene/* > name&type&orf* > > mycobacteriumtuberculosis_query_level0= > select count(*) from *genenametype* where type = 'ordered locus'; > > mycobacteriumtuberculosis_query_level1=select count(*) from *genenametype* > where type = 'orf'; > > mycobacteriumtuberculosis_table_name_level0= > Ordered Locus > > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > ---------------------------------------------------------------------------------------------------------------------------- > > Of course these queries would have be manually verified prior to making > these changes but this seems like we are moving in the right direction. > > Richard > > > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >> Just got done reading previous email and understand the change in >> priority. >> >> Will work on the missing ID's for now and shelve the the TalleyEngine >> issue for the moment. >> >> Also great about a more formalized weekly meeting. I was going to suggest >> it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> >> Will dig further into the missing ID's later tonight and during day >> tomorrow and report back. >> >> Richard >> >> On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote: >> >>> Hi Rich, >>> >>> No problem. The pertinent line you're referring to, for XML, is this, >>> right above the line you copied: >>> >>> >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> >>> The slash-separated section is the "path" of XML tags leading to the >>> element of interest; then, after the ampersand, is a name/value pair for the >>> desired attribute to count. Note that there is no hint of a *content*-based >>> filter (nor is there the capability for one, as far as I can tell in the >>> code). By "content," I mean that we can't specify filters based on what's >>> *between* the tags. We can only go as far as filter by attribute value, >>> e.g., type="ordered locus". >>> >>> But anyway, as mentioned in the earlier e-mail, let's have the missing >>> IDs in the .gdb take precedence for now. Please take a look at the >>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >>> the ID output can be customized, then let me know if you have any questions >>> or need to confirm anything. >>> >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> >>> >>> On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> >>> > Sorry been slammed with a programming assignment that kept needing >>> continued iteration and it has been all consuming until last night. But I >>> did get a chance to work with your comments and review the code again with a >>> different mind set. >>> > >>> > Yes, I examined the gmbuilder.properties file ( the query is also in >>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >>> previous email ) but I don't think I see what you mean regarding the XML >>> count. >>> > >>> > I understood that: mycobacteriumtuberculosis_query_level0=select >>> count(*) from genenametype where type = 'ordered locus' and value like >>> 'Rv%'; was the db query but don't see which is the XML count... or do they >>> share the same query and you are saying that XML count doesn't recognize and >>> therefore cannot use the 'Rv%' parameter? >>> > >>> > Richard >>> > >>> > >>> > >>> > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > Hi Rich, >>> > >>> > Sorry for the delay. Had some distractions coming into the weekend. >>> > >>> > You've looked at the code; have you looked at gmbuilder.properties? (I >>> may have mentioned it a few e-mails ago, just as you were starting to dig >>> into this) >>> > >>> > On the copy I have, the M. tuberculosis block looks like this >>> (indentation is mine to set it apart): >>> > >>> > # Mycobacterium tuberculosis >>> > mycobacteriumtuberculosis_level_amount=1 >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > mycobacteriumtuberculosis_query_level0=select count(*) from >>> genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > >>> > There, I think, is the rub. Notice that the XML count does not filter >>> on RV%. The SQL query does. >>> > >>> > Unfortunately, I don't think the TallyEngine can include selective >>> filtering in the XML counts. If the need to do selective filtering on XML >>> is necessary, then I think we're looking at a new functionality for you to >>> implement (or, if this throws things off too much, this may have to be noted >>> somewhere, that the XML vs. database counts may be off because the database >>> count is doing some text-based filtering but the XML count does not). >>> > >>> > What does xmlpipedb-match say? That will at least tell you whether the >>> 'RV%' count is indeed correct. >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > >>> > > OK here is what I was able to put together from the past few hours of >>> code review: >>> > > >>> > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > > -reveals that after the 2 System table modifications are made adding >>> species name and link, a PreparedStatement is instantiated which builds and >>> calls the base query. >>> > > >>> > > -The base query called is: ("SELECT value, type " + "FROM >>> genenametype INNER JOIN entrytype_genetype " + >>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>> > > >>> > > -So its looking in 'ordered locus' table/column for any tuple that >>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>> > > The 'like' comparator and % usage are clear with the 'type' >>> entrytype_gene_hjid = ? >>> > > >>> > > -To me it seems the query makes sense so the problem is likely >>> elsewhere. >>> > > >>> > > GenMappBuilder.java: >>> > > -I found method doTallies() at code line 895 which: >>> > > Instantiates a Configuration called hibernateConfiguration and >>> assigns to it the current hibernate configuration >>> > > Validates database settings by analyzing hibernateConfiguration >>> > > Instantiates a CriterionList for uniprot and assigns to it >>> TallyType.UNIPROT >>> > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>> > > Determines if both xml files exist >>> > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >>> xml files and their respective CriterionList >>> > > Results are then formatted for display in a table. >>> > > >>> > > -So enum TallyType which means that they are the only valid datatypes >>> which TallyEngine accepts... go to know ... >>> > > >>> > > -Based on the screen shot of Tally Engine it would seem that both >>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >>> returning. Likely due to both using an incorrect query (as we previously >>> supposed). But where are the queries?... the more I dig the more I think >>> they are in the criterial all the work is done against. >>> > > >>> > > continuing the review: >>> > > getTallyResultsXML() calls Tally Engine instance method >>> getXmlFileCounts(xmlFile) >>> > > getTallyResultsDatabase() calls Tally Engine instance method >>> getDbcounts(new QueryEngine(hibernateConfiguration) >>> > > Both of these instanced methods originate from TallyEngine.java... >>> > > >>> > > TallyEngine.java: >>> > > >>> > > getXmlFileCounts() calls digestXmlFile() which instantiates a >>> digester then processes against criteria... but this quickly becomes >>> confusing and is hard to follow >>> > > >>> > > getDbcounts() then starts a db session and executes a query but then >>> I also get a bit lost with my limited db knowledge. >>> > > >>> > > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > > >>> > > OVERALL I think I'm getting closer to the issues but I still feel as >>> if I'm missing some understanding to proceed further. Can you pass along >>> some of that Dondi insight and steer me in the right direction? =D >>> > > >>> > > -DB Tally - Not having taken databases yet certainly is limiting my >>> ability determine where the "criteria" are being set and how they are >>> followed during session activities. Also is the query we have been looking >>> for this whole time in the criteria or someplace else? >>> > > >>> > > -XML Tally - again is the query contained within the criteria that >>> digestXmlFile() uses to parse? >>> > > >>> > > Richard >>> > > >>> > > >>> > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > Right, schema issues are unlikely. Most count discrepancies like >>> this that I've seen have boiled down to forming the right query. Then, >>> knowing the right query (in both XML and SQL), it's a matter of making sure >>> that TallyEngine asks that same query. >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > > >>> > > > OK, so based on your approach: >>> > > > >>> > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >>> sql queries needed for the respective results as you requested. >>> > > > >>> > > > I was also thinking I may need to review the schema from xml into >>> postgres but the issue isn't likely a schema error. The error most likely >>> lies in how xmlpipedbutils queries the data from xml source and writes to >>> the tables what it returns? >>> > > > >>> > > > 2. I'll review the code: trace the entrance of tally engine in the >>> gmbuilder code then follow it through the xmlpipedbutils. >>> > > > >>> > > > Richard >>> > > > >>> > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >>> SourceForge's Subversion repo. >>> > > > >>> > > > John David N. Dionisio, PhD >>> > > > Associate Professor, Computer Science >>> > > > Loyola Marymount University >>> > > > >>> > > > >>> > > > >>> > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > > > >>> > > > > Hi Dondi, >>> > > > > >>> > > > > So I'm at the point in working with M tuberculosis that I was >>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>> > > > > >>> > > > > gmb2b60 Results >>> > > > > >>> > > > > >>> > > > > >>> > > > > Now the proverbial question - What next to solve the Ordered >>> Locus import/count issue? >>> > > > > >>> > > > > ********************************************** >>> > > > > Here is my thought process: >>> > > > > >>> > > > > Step 1: How does the import process work at the high level? >>> (obviously correct me if I'm wrong) >>> > > > > >>> > > > > I believe that basically as each XML tag is read, it is placed in >>> the proper Postgres table(s) based on some criteria. There is also likely >>> some sort of check that each individual tag is in valid XML format unless we >>> don't care at this stage (care at export) or maybe the parser just skips >>> over and goes on to the next . >>> > > > > >>> > > > > Step 2: What could be the problem? >>> > > > > >>> > > > > Either - >>> > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>> > > > > b. Decision criteria of which table they should be added to? >>> > > > > >>> > > > > ********************************************** >>> > > > > >>> > > > > I read on the sourceforge wiki: >>> > > > > >>> > > > > XMLPipeDB has a modular architecture with three components that >>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >>> Definition) and automatically generates an SQL schema, Java classes, and >>> Hibernate mappings. XMLPipeDB Utilities provides functionality for >>> configuring the database, importing data, and performing queries. GenMAPP >>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >>> Gene Databases based on data from UniProt and Gene Ontology (GO). >>> > > > > >>> > > > > So I should probably start with the XMLPipeDB Utilities which are >>> where? I don't see any in the basic distribution or are they not standalone >>> and called from the command line? >>> > > > > >>> > > > > Thanks! >>> > > > > >>> > > > > Richard >>> > > > >>> > > > >>> > > > <ATT00001..txt><ATT00002..txt> >>> > > >>> > > >>> > > >>> ------------------------------------------------------------------------------ >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > > Pinpoint memory and threading errors before they happen. >>> > > Find and fix more than 250 security defects in the development cycle. >>> > > Locate bottlenecks in serial and parallel code that limit >>> performance. >>> > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > _______________________________________________ >>> > > xmlpipedb-developer mailing list >>> > > xml...@li... >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > >>> > > <ATT00001..txt><ATT00002..txt> >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > Pinpoint memory and threading errors before they happen. >>> > Find and fix more than 250 security defects in the development cycle. >>> > Locate bottlenecks in serial and parallel code that limit performance. >>> > http://p.sf.net/sfu/intel-dev2devfeb >>> > _______________________________________________ >>> > xmlpipedb-developer mailing list >>> > xml...@li... >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > >>> > <ATT00001..txt><ATT00002..txt> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> Pinpoint memory and threading errors before they happen. >>> Find and fix more than 250 security defects in the development cycle. >>> Locate bottlenecks in serial and parallel code that limit performance. >>> http://p.sf.net/sfu/intel-dev2devfeb >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >> >> > |
From: Richard B. <rbr...@gm...> - 2011-02-21 05:35:33
|
OK, have some updates and some suggestions: On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report. We verified that XML match does indeed find 4066 unique matches - 75 of which are not in the gdb and need to be. Dr. Dahlquist informed me that she was the one who completed the gene db testing report, not a previous student of BIO367 and had already verified which genes were missing and where they were to be found. I had (mistakenly) assumed that since a student had performed the gene database testing I had to redo all of the verification. So that said, of the 75 genes missing - 69 need to be included and 6 excluded. Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix. They are all found in the ORF tag and need to be included in the gdb." To solve this we need to add additional search criteria into the M. tuberculosis section in gmbuilder.properties below: # *Mycobacterium* tuberculosis mycobacteriumtuberculosis_level_amount=1 mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/* name&type&ordered* locus mycobacteriumtuberculosis_query_level0=select count(*) from *genenametype* where type = 'ordered locus' and value like '*Rv*%'; mycobacteriumtuberculosis_table_name_level0=Ordered Locus SOLUTIONS: 1. So am i correct in my understanding that the second line is the query used by TallyEngine to read the XML file? If so then this is the issue we need to table for the moment until we get the gbd verified and re-released. We will revisit this to discover why it is not only reporting incorrectly but also why its added a second row of Ordered Locus on the TallyEngine results page. 2. The third line is the SQL query used by postgres during the export from XML to gdb. To find and get the ORF tagged genes could we not add the following lines and change the count in the first line: # *Mycobacterium* tuberculosis mycobacteriumtuberculosis_level_amount=2 mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/* name&type&ordered* locus mycobacteriumtuberculosis_element_level1=*uniprot*/entry/gene/*name&type&orf * mycobacteriumtuberculosis_query_level0=select count(*) from *genenametype* where type = 'ordered locus'; mycobacteriumtuberculosis_query_level1=select count(*) from *genenametype* where type = 'orf'; mycobacteriumtuberculosis_table_name_level0=Ordered Locus mycobacteriumtuberculosis_table_name_level1=Ordered Locus ---------------------------------------------------------------------------------------------------------------------------- Of course these queries would have be manually verified prior to making these changes but this seems like we are moving in the right direction. Richard On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > Just got done reading previous email and understand the change in priority. > > Will work on the missing ID's for now and shelve the the TalleyEngine issue > for the moment. > > Also great about a more formalized weekly meeting. I was going to suggest > it myself as it has been slow going so far as maybe i'm a bit too > independent in this independent study class =D > > Will dig further into the missing ID's later tonight and during day > tomorrow and report back. > > Richard > > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote: > >> Hi Rich, >> >> No problem. The pertinent line you're referring to, for XML, is this, >> right above the line you copied: >> >> >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> >> The slash-separated section is the "path" of XML tags leading to the >> element of interest; then, after the ampersand, is a name/value pair for the >> desired attribute to count. Note that there is no hint of a *content*-based >> filter (nor is there the capability for one, as far as I can tell in the >> code). By "content," I mean that we can't specify filters based on what's >> *between* the tags. We can only go as far as filter by attribute value, >> e.g., type="ordered locus". >> >> But anyway, as mentioned in the earlier e-mail, let's have the missing IDs >> in the .gdb take precedence for now. Please take a look at the >> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >> the ID output can be customized, then let me know if you have any questions >> or need to confirm anything. >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> >> > Sorry been slammed with a programming assignment that kept needing >> continued iteration and it has been all consuming until last night. But I >> did get a chance to work with your comments and review the code again with a >> different mind set. >> > >> > Yes, I examined the gmbuilder.properties file ( the query is also in the >> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >> previous email ) but I don't think I see what you mean regarding the XML >> count. >> > >> > I understood that: mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... or do they >> share the same query and you are saying that XML count doesn't recognize and >> therefore cannot use the 'Rv%' parameter? >> > >> > Richard >> > >> > >> > >> > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > Hi Rich, >> > >> > Sorry for the delay. Had some distractions coming into the weekend. >> > >> > You've looked at the code; have you looked at gmbuilder.properties? (I >> may have mentioned it a few e-mails ago, just as you were starting to dig >> into this) >> > >> > On the copy I have, the M. tuberculosis block looks like this >> (indentation is mine to set it apart): >> > >> > # Mycobacterium tuberculosis >> > mycobacteriumtuberculosis_level_amount=1 >> > >> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >> > mycobacteriumtuberculosis_query_level0=select count(*) from >> genenametype where type = 'ordered locus' and value like 'Rv%'; >> > >> > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > >> > There, I think, is the rub. Notice that the XML count does not filter >> on RV%. The SQL query does. >> > >> > Unfortunately, I don't think the TallyEngine can include selective >> filtering in the XML counts. If the need to do selective filtering on XML >> is necessary, then I think we're looking at a new functionality for you to >> implement (or, if this throws things off too much, this may have to be noted >> somewhere, that the XML vs. database counts may be off because the database >> count is doing some text-based filtering but the XML count does not). >> > >> > What does xmlpipedb-match say? That will at least tell you whether the >> 'RV%' count is indeed correct. >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > >> > > OK here is what I was able to put together from the past few hours of >> code review: >> > > >> > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > > -reveals that after the 2 System table modifications are made adding >> species name and link, a PreparedStatement is instantiated which builds and >> calls the base query. >> > > >> > > -The base query called is: ("SELECT value, type " + "FROM genenametype >> INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = >> entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like >> 'Rv%' and entrytype_gene_hjid = ?") >> > > >> > > -So its looking in 'ordered locus' table/column for any tuple that >> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >> > > The 'like' comparator and % usage are clear with the 'type' >> entrytype_gene_hjid = ? >> > > >> > > -To me it seems the query makes sense so the problem is likely >> elsewhere. >> > > >> > > GenMappBuilder.java: >> > > -I found method doTallies() at code line 895 which: >> > > Instantiates a Configuration called hibernateConfiguration and assigns >> to it the current hibernate configuration >> > > Validates database settings by analyzing hibernateConfiguration >> > > Instantiates a CriterionList for uniprot and assigns to it >> TallyType.UNIPROT >> > > Instantiates a CriterionList for go and assigns to it TallyType.GO >> > > Determines if both xml files exist >> > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >> xml files and their respective CriterionList >> > > Results are then formatted for display in a table. >> > > >> > > -So enum TallyType which means that they are the only valid datatypes >> which TallyEngine accepts... go to know ... >> > > >> > > -Based on the screen shot of Tally Engine it would seem that both >> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >> returning. Likely due to both using an incorrect query (as we previously >> supposed). But where are the queries?... the more I dig the more I think >> they are in the criterial all the work is done against. >> > > >> > > continuing the review: >> > > getTallyResultsXML() calls Tally Engine instance method >> getXmlFileCounts(xmlFile) >> > > getTallyResultsDatabase() calls Tally Engine instance method >> getDbcounts(new QueryEngine(hibernateConfiguration) >> > > Both of these instanced methods originate from TallyEngine.java... >> > > >> > > TallyEngine.java: >> > > >> > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester >> then processes against criteria... but this quickly becomes confusing and is >> hard to follow >> > > >> > > getDbcounts() then starts a db session and executes a query but then I >> also get a bit lost with my limited db knowledge. >> > > >> > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > > >> > > OVERALL I think I'm getting closer to the issues but I still feel as >> if I'm missing some understanding to proceed further. Can you pass along >> some of that Dondi insight and steer me in the right direction? =D >> > > >> > > -DB Tally - Not having taken databases yet certainly is limiting my >> ability determine where the "criteria" are being set and how they are >> followed during session activities. Also is the query we have been looking >> for this whole time in the criteria or someplace else? >> > > >> > > -XML Tally - again is the query contained within the criteria that >> digestXmlFile() uses to parse? >> > > >> > > Richard >> > > >> > > >> > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > > Right, schema issues are unlikely. Most count discrepancies like this >> that I've seen have boiled down to forming the right query. Then, knowing >> the right query (in both XML and SQL), it's a matter of making sure that >> TallyEngine asks that same query. >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > > >> > > > OK, so based on your approach: >> > > > >> > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql >> queries needed for the respective results as you requested. >> > > > >> > > > I was also thinking I may need to review the schema from xml into >> postgres but the issue isn't likely a schema error. The error most likely >> lies in how xmlpipedbutils queries the data from xml source and writes to >> the tables what it returns? >> > > > >> > > > 2. I'll review the code: trace the entrance of tally engine in the >> gmbuilder code then follow it through the xmlpipedbutils. >> > > > >> > > > Richard >> > > > >> > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >> do...@lm...> wrote: >> > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- >> the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >> SourceForge's Subversion repo. >> > > > >> > > > John David N. Dionisio, PhD >> > > > Associate Professor, Computer Science >> > > > Loyola Marymount University >> > > > >> > > > >> > > > >> > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > > > >> > > > > Hi Dondi, >> > > > > >> > > > > So I'm at the point in working with M tuberculosis that I was able >> to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >> > > > > >> > > > > gmb2b60 Results >> > > > > >> > > > > >> > > > > >> > > > > Now the proverbial question - What next to solve the Ordered Locus >> import/count issue? >> > > > > >> > > > > ********************************************** >> > > > > Here is my thought process: >> > > > > >> > > > > Step 1: How does the import process work at the high level? >> (obviously correct me if I'm wrong) >> > > > > >> > > > > I believe that basically as each XML tag is read, it is placed in >> the proper Postgres table(s) based on some criteria. There is also likely >> some sort of check that each individual tag is in valid XML format unless we >> don't care at this stage (care at export) or maybe the parser just skips >> over and goes on to the next . >> > > > > >> > > > > Step 2: What could be the problem? >> > > > > >> > > > > Either - >> > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >> > > > > b. Decision criteria of which table they should be added to? >> > > > > >> > > > > ********************************************** >> > > > > >> > > > > I read on the sourceforge wiki: >> > > > > >> > > > > XMLPipeDB has a modular architecture with three components that >> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >> Definition) and automatically generates an SQL schema, Java classes, and >> Hibernate mappings. XMLPipeDB Utilities provides functionality for >> configuring the database, importing data, and performing queries. GenMAPP >> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >> Gene Databases based on data from UniProt and Gene Ontology (GO). >> > > > > >> > > > > So I should probably start with the XMLPipeDB Utilities which are >> where? I don't see any in the basic distribution or are they not standalone >> and called from the command line? >> > > > > >> > > > > Thanks! >> > > > > >> > > > > Richard >> > > > >> > > > >> > > > <ATT00001..txt><ATT00002..txt> >> > > >> > > >> > > >> ------------------------------------------------------------------------------ >> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > > Pinpoint memory and threading errors before they happen. >> > > Find and fix more than 250 security defects in the development cycle. >> > > Locate bottlenecks in serial and parallel code that limit performance. >> > > http://p.sf.net/sfu/intel-dev2devfeb >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > >> ------------------------------------------------------------------------------ >> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > Pinpoint memory and threading errors before they happen. >> > Find and fix more than 250 security defects in the development cycle. >> > Locate bottlenecks in serial and parallel code that limit performance. >> > http://p.sf.net/sfu/intel-dev2devfeb >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> >> ------------------------------------------------------------------------------ >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > |
From: Richard B. <rbr...@gm...> - 2011-02-18 03:47:21
|
Just got done reading previous email and understand the change in priority. Will work on the missing ID's for now and shelve the the TalleyEngine issue for the moment. Also great about a more formalized weekly meeting. I was going to suggest it myself as it has been slow going so far as maybe i'm a bit too independent in this independent study class =D Will dig further into the missing ID's later tonight and during day tomorrow and report back. Richard On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote: > Hi Rich, > > No problem. The pertinent line you're referring to, for XML, is this, > right above the line you copied: > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > The slash-separated section is the "path" of XML tags leading to the > element of interest; then, after the ampersand, is a name/value pair for the > desired attribute to count. Note that there is no hint of a *content*-based > filter (nor is there the capability for one, as far as I can tell in the > code). By "content," I mean that we can't specify filters based on what's > *between* the tags. We can only go as far as filter by attribute value, > e.g., type="ordered locus". > > But anyway, as mentioned in the earlier e-mail, let's have the missing IDs > in the .gdb take precedence for now. Please take a look at the > tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how > the ID output can be customized, then let me know if you have any questions > or need to confirm anything. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: > > > Sorry been slammed with a programming assignment that kept needing > continued iteration and it has been all consuming until last night. But I > did get a chance to work with your comments and review the code again with a > different mind set. > > > > Yes, I examined the gmbuilder.properties file ( the query is also in the > MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > > > > I understood that: mycobacteriumtuberculosis_query_level0=select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; was > the db query but don't see which is the XML count... or do they share the > same query and you are saying that XML count doesn't recognize and therefore > cannot use the 'Rv%' parameter? > > > > Richard > > > > > > > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > Sorry for the delay. Had some distractions coming into the weekend. > > > > You've looked at the code; have you looked at gmbuilder.properties? (I > may have mentioned it a few e-mails ago, just as you were starting to dig > into this) > > > > On the copy I have, the M. tuberculosis block looks like this > (indentation is mine to set it apart): > > > > # Mycobacterium tuberculosis > > mycobacteriumtuberculosis_level_amount=1 > > > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > > > > mycobacteriumtuberculosis_query_level0=select count(*) from > genenametype where type = 'ordered locus' and value like 'Rv%'; > > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > > > > There, I think, is the rub. Notice that the XML count does not filter on > RV%. The SQL query does. > > > > Unfortunately, I don't think the TallyEngine can include selective > filtering in the XML counts. If the need to do selective filtering on XML > is necessary, then I think we're looking at a new functionality for you to > implement (or, if this throws things off too much, this may have to be noted > somewhere, that the XML vs. database counts may be off because the database > count is doing some text-based filtering but the XML count does not). > > > > What does xmlpipedb-match say? That will at least tell you whether the > 'RV%' count is indeed correct. > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: > > > > > OK here is what I was able to put together from the past few hours of > code review: > > > > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > > > -reveals that after the 2 System table modifications are made adding > species name and link, a PreparedStatement is instantiated which builds and > calls the base query. > > > > > > -The base query called is: ("SELECT value, type " + "FROM genenametype > INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid = > entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like > 'Rv%' and entrytype_gene_hjid = ?") > > > > > > -So its looking in 'ordered locus' table/column for any tuple that > starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . > > > The 'like' comparator and % usage are clear with the 'type' > entrytype_gene_hjid = ? > > > > > > -To me it seems the query makes sense so the problem is likely > elsewhere. > > > > > > GenMappBuilder.java: > > > -I found method doTallies() at code line 895 which: > > > Instantiates a Configuration called hibernateConfiguration and assigns > to it the current hibernate configuration > > > Validates database settings by analyzing hibernateConfiguration > > > Instantiates a CriterionList for uniprot and assigns to it > TallyType.UNIPROT > > > Instantiates a CriterionList for go and assigns to it TallyType.GO > > > Determines if both xml files exist > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both xml > files and their respective CriterionList > > > Results are then formatted for display in a table. > > > > > > -So enum TallyType which means that they are the only valid datatypes > which TallyEngine accepts... go to know ... > > > > > > -Based on the screen shot of Tally Engine it would seem that both > getTallyResultsXML() and getTallyResultsDatabase() are incorrectly > returning. Likely due to both using an incorrect query (as we previously > supposed). But where are the queries?... the more I dig the more I think > they are in the criterial all the work is done against. > > > > > > continuing the review: > > > getTallyResultsXML() calls Tally Engine instance method > getXmlFileCounts(xmlFile) > > > getTallyResultsDatabase() calls Tally Engine instance method > getDbcounts(new QueryEngine(hibernateConfiguration) > > > Both of these instanced methods originate from TallyEngine.java... > > > > > > TallyEngine.java: > > > > > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester > then processes against criteria... but this quickly becomes confusing and is > hard to follow > > > > > > getDbcounts() then starts a db session and executes a query but then I > also get a bit lost with my limited db knowledge. > > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > > > > OVERALL I think I'm getting closer to the issues but I still feel as if > I'm missing some understanding to proceed further. Can you pass along some > of that Dondi insight and steer me in the right direction? =D > > > > > > -DB Tally - Not having taken databases yet certainly is limiting my > ability determine where the "criteria" are being set and how they are > followed during session activities. Also is the query we have been looking > for this whole time in the criteria or someplace else? > > > > > > -XML Tally - again is the query contained within the criteria that > digestXmlFile() uses to parse? > > > > > > Richard > > > > > > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> > wrote: > > > Right, schema issues are unlikely. Most count discrepancies like this > that I've seen have boiled down to forming the right query. Then, knowing > the right query (in both XML and SQL), it's a matter of making sure that > TallyEngine asks that same query. > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > > Loyola Marymount University > > > > > > > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > > > > > > > OK, so based on your approach: > > > > > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql > queries needed for the respective results as you requested. > > > > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > > > > > > > > 2. I'll review the code: trace the entrance of tally engine in the > gmbuilder code then follow it through the xmlpipedbutils. > > > > > > > > Richard > > > > > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < > do...@lm...> wrote: > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) --- > the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in > SourceForge's Subversion repo. > > > > > > > > John David N. Dionisio, PhD > > > > Associate Professor, Computer Science > > > > Loyola Marymount University > > > > > > > > > > > > > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: > > > > > > > > > Hi Dondi, > > > > > > > > > > So I'm at the point in working with M tuberculosis that I was able > to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. > > > > > > > > > > gmb2b60 Results > > > > > > > > > > > > > > > > > > > > Now the proverbial question - What next to solve the Ordered Locus > import/count issue? > > > > > > > > > > ********************************************** > > > > > Here is my thought process: > > > > > > > > > > Step 1: How does the import process work at the high level? > (obviously correct me if I'm wrong) > > > > > > > > > > I believe that basically as each XML tag is read, it is placed in > the proper Postgres table(s) based on some criteria. There is also likely > some sort of check that each individual tag is in valid XML format unless we > don't care at this stage (care at export) or maybe the parser just skips > over and goes on to the next . > > > > > > > > > > Step 2: What could be the problem? > > > > > > > > > > Either - > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? > > > > > b. Decision criteria of which table they should be added to? > > > > > > > > > > ********************************************** > > > > > > > > > > I read on the sourceforge wiki: > > > > > > > > > > XMLPipeDB has a modular architecture with three components that may > be used separately or together. XSD-to-DB reads an XSD (XML Schema > Definition) and automatically generates an SQL schema, Java classes, and > Hibernate mappings. XMLPipeDB Utilities provides functionality for > configuring the database, importing data, and performing queries. GenMAPP > Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible > Gene Databases based on data from UniProt and Gene Ontology (GO). > > > > > > > > > > So I should probably start with the XMLPipeDB Utilities which are > where? I don't see any in the basic distribution or are they not standalone > and called from the command line? > > > > > > > > > > Thanks! > > > > > > > > > > Richard > > > > > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > > > > ------------------------------------------------------------------------------ > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: > > > Pinpoint memory and threading errors before they happen. > > > Find and fix more than 250 security defects in the development cycle. > > > Locate bottlenecks in serial and parallel code that limit performance. > > > http://p.sf.net/sfu/intel-dev2devfeb > > > _______________________________________________ > > > xmlpipedb-developer mailing list > > > xml...@li... > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > > Pinpoint memory and threading errors before they happen. > > Find and fix more than 250 security defects in the development cycle. > > Locate bottlenecks in serial and parallel code that limit performance. > > http://p.sf.net/sfu/intel-dev2devfeb > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: > Pinpoint memory and threading errors before they happen. > Find and fix more than 250 security defects in the development cycle. > Locate bottlenecks in serial and parallel code that limit performance. > http://p.sf.net/sfu/intel-dev2devfeb > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |