Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-02-23 19:46:46
|
Great! Now the thumbs up from Dondi and we can get cracking. I'll dig into sql syntax for negation of a specific tuple while we wait for his response. Richard On Wed, Feb 23, 2011 at 11:37 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > That sounds good to me. > > Best, > Dr. D > > > At 11:16 AM 2/23/2011, you wrote: > > Hmm... possible that the method to split up records has been globally > implemented for export from Postgres? > > Since the records are automatically splitting when exported from postgres > to gdb file, all we need to do is exclude Rv3346/55c 'ORF' from "select > count (*) from genenametype where (type = 'ordered locus' or type = 'ORF') > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set. > > That seem correct Dondi? > > Richard > > On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist <kda...@lm...> > wrote: > Hi, > > I checked these out. They all should be separated and kept as separate > OrderedLocusNames records. In fact, the gdb already has them separated into > individual records, so it is only the tally for Postgres that is off, they > were correctly exported into the gdb. > > Best, > Dr. D > > > At 07:31 PM 2/22/2011, Richard Brous wrote: > > OK I found the culprits: > > There are 3 ordered locus gene ID's that have slashes. As was previously > discussed it seems XML Match is counting each one twice which would inflate > the count by 3. > > Rv2561/Rv2562 'ordered locus' > Rv2880c/Rv2879c 'ordered locus' > Rv3021c/Rv3022c 'ordered locus' > > And as was previously noted, Rv3346/55c 'ORF' should be excluded. > > This will bring the both XML and sql db queries counts in sync at 4057. > > By my math: > > ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: > 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by > slashes = total 4057 genes > > ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: > 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > > > TO MOVE FORWARD: > > 1. We need to decide how to address the slash ID's (should we keep and > split into separate tuples or should they be excluded) > > 2. Adjust the queries to reflect what is needed and test with raw sql. > > 3. Update the queries on > MycobacteriumTuberculosisUniProtSpeciesProfile.java > > 4. Export a new Mtb gene database for testing. > > > Richard > > > > On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > I don't know why the numbers are off. I think the only way to find out is > to get the results of the Postgres query and line it up next to the match > results and see what is different. If you can send me the lists of IDs from > the Postgres query and the match results, I can check them. > Best, Dr. D > > At 04:57 PM 2/22/2011, you wrote: > > Here is where I am on the numbers: ------------------------------------------------------------------------------------------------------------ > XML tags read by Match: 4066 unique matches - 6 genes excluded by Dr. D = > total 4060 genes ------------------------------------------------------------------------------------------------------------ > sql query of the db tables: 3988 ordered locus + 69 ORF = total 4057 genes > ------------------------------------------------------------------------------------------------------------ > So by my count we are off by 3 genes Maybe XML Match is counting slashed > genes as 2 separate genes? IE. so if it encountered 3 slashed genes it > would in effect be counting those 3 as 6? Richard > On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, I compared this list of gene IDs with the list I had on the Testing > Report on the wiki and found the following. Your list has 70 IDs. 69 of > them are identical to what is in my list of IDs in the Testing report. Indeed, > the only difference is the one with the slash. "Rv3346/55c" I looked this > ID up at UniProt.org as Rv3346. It appears in the record for UniProt ID > O50384. For that UniProt record, the ID referred to in the OrderedLocus tag > is Rv3355c which does appear in the gdb already. Looking up Rv3346 on > Tuberculist and the Stanford TB database, Rv3346 is not a real gene ID. So > this entire record with the slash in it needs to be excluded and all the > rest of them need to be captured. > I think then, that the numbers should add up correctly, is this true? Best, > Dr. D > At 01:09 PM 2/22/2011, Richard Brous wrote: > > attachement now included... On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous > <rbr...@gm...> wrote: Here is an export of the genes found using: > select * from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. > 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 > 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 > 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 > 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 > 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 > 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 > 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 > 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 > 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 > 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 > 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 > 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 > 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 > 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 > 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 > 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 > 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 > 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 > 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 > 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 > 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 > 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 > 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 > 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 > 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 > 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 > 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 > 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 > 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 > 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 > 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 > 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 > 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 > 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 > 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 > 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 > 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 > 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 > 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 > 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 > 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 > 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 > 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 > 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 > 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 > 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 > 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 > 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 > 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 > 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 > 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 > 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 > 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 > 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 > 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 > 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 > 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 > 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 > 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 > 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 > 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 > 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 > 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 > 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 > 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 > 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 > 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 > 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 > 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 > 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The > item of note I see is the gene with the slash separating gene id's which > refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 PM, > Richard Brous <rbr...@gm...> wrote: Understood. I'll check in with > Dr. D in the afternoon tomorrow and discuss. Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...> > wrote: Hi Rich, Addressing the release business first, let's put it this > way: if the remaining loose ends can be addressed by tomorrow, we can > probably wait until then. If unexpected snags are encountered, then it > would be worthwhile to release whatever you have. With that in mind, > considering that you pretty much know the patterns of the IDs that are > needed, I think it will only take a little digital forensic work now to > figure out exactly which IDs are still needed. Once you know what those > are, you should: 1. Find where they are in the XML file. 2. Knowing the > XML location, find the corresponding table in the relational database (table > names are generally derived from tag/element names). 3. Knowing the table > in the database, write or extend the SpeciesProfile query to retrieve that > data. For the ID that must *not* be included, again it's a matter of > tracking down what this ID is. Knowing this straggler, you can then consult > with Dr. Dahlquist if this ID is truly a unique one-off, or is > representative of a pattern that we'll want to exclude. Either way, this ID > can be omitted by using "not" or "<>" or possibly "not like" or "not ~" > (check PostgreSQL where clause syntax to see where the negation can be > applied). John David N. Dionisio, PhD Associate Professor, Computer > Science Loyola Marymount University > On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > actually i had a typo > (emailing from desktop system but testing on my laptop... typed correctly > here but wrong in pgadmin) but the results make much more sense now. > > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > returns 4058 > > select count (*) from > genenametype where (type = 'ordered locus' or type = 'ORF') and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > > > -------------------------------------------------------------------------------------------------------------------------- > > > > Continuing forward - > > The testing report says that 4066 unique > matches exist in XML but 6 of them were eliminated by Dr. D leaving the > desired number at 4060. > > So now we are only 2 genes short with the > query returning 4058... which is also (conveniently) the sum of the two > separate queries of 'ordered locus' and 'ORF' respectively. > > But > recall that Dr. D said that only 69 genes of the missing 75 were tagged > 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding > that into missing genes puts us 3 short... > > Should I make the changes > to the code and export a gdb so that analysis can be done or wait until we > work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at > 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The > second form should have worked actually. What exactly was the error? > > > John David N. Dionisio, PhD > Associate Professor, Computer Science > > Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard > Brous wrote: > > > hmm not taking parenthesis where I thought they should > go... syntax error > > > > select count (*) from genenametype where type = > ('ordered locus' or 'ORF') and value like 'Rv%'; > > also tried > > > select count (*) from genenametype where (type = 'ordered locus' or type = > 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011 > at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see > it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio > <do...@lm...> wrote: > > Watch your parentheses: "and" has greater > precedence than "or" :) > > > > > > John David N. Dionisio, PhD > > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> > wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >> > 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where > type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; > > >> returns 70 > >> > >> select count (*) from genenametype where type = > 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 > > >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' > > >> > >> select count (*) from genenametype where type = 'ordered locus' > and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >> > select count (*) from genenametype where type = 'ordered locus' or type = > 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> > > >> select count (*) from genenametype where type = 'ORF' and value ~ > '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: > > >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 > genes were were looking for plus one more (maybe the count for missing genes > is off by 1?). > >> > >> 2. Combining the two types in a single query did > not produce the results that I expected (7011? - how did that happen????) so > this is likely not our solution... unless of course the query syntax isn't > actually doing what I think it is... > >> > >> 3. I would think the best > course of action is to serialy run two separate queries to capture all the > required genes, then removing the one unneeded gene if its truly not wanted. > > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb > 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I > don't recall the exact details of the missing 69, but if your query > successfully returns them in raw SQL, then this is worth a try. You can > integrate into the same query as long as the same columns are returned, > which is the case here AFAIK, so go ahead and extend the existing query. > > >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor, > Computer Science > >> Loyola Marymount University > >> > >> On Feb 21, > 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So > here is the appropriate code snippet from > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> > > >>> TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > >>> > >>> // Build the base > query; we only use "ordered locus" and we only want > >>> > >>> // IDs > that begin with "Rv." > >>> PreparedStatement ps = > ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> > > >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN > entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid = > entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and > value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; > > >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows()) > { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); > > >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually > want to keep the case where multiple ordered locus > >>> > >>> // names > appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this > name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>> > > >>> for (String id : result.getString("value").split("/")) { > >>> > tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert, > new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, > { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, { > "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>> > > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > >>> So now we want to build the base query which uses "ordered locus" and > "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there > are more comprehensive ways to search for gene ID's by matching gene ID > prefix but "like Rv%" seemed to work thus far, we just need to tell it to > search for XML tag type orf in addition to ordered locus. > >>> > >>> > "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and > entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part > of our class was right as the server went down and my submission for week 6 > assignment I can't seem to find. > >>> > >>> Is it possible to have two > different types in the same query or should we rewrite a separate query for > the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20, > 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>> > thanks and will do as directed. > >>> > >>> My previous, last paragraph > comment - A way for programming code in email holding its format in a mail > message similarly to how you can post code on forum pages? > >>> > >>> > <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! > > >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David > N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>> > Actually, gmbuilder.properties is for the TallyEngine only. When dealing > with .gdb exports, look *only* at the SpeciesProfile class. So, to find > those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile > code, that needs to be changed. > >>> > >>> Your take on how > gmbuilder.properties is used, however, is understandable. It makes sense to > assume that the TallyEngine code *and* the ID export code are based on the > same characterization of the needed IDs. This replication is originally a > historical artifact: SpeciesProfile was done first, and then TallyEngine was > done later by another student. > >>> > >>> However, there are other > factors beyond history that sort of necessitate this duplication of desired > IDs: (skip the two bullets below if you'd rather cut to the chase of the > work to be done, and discuss design issues later) > >>> > >>> - The actual > XML import code is a black box: this is the "canned" JAXB library actually > in action, and not our code at all. Plus, the XML import code really does > not filter (nor should it), since the goal of the XML->relational database > step is to fully capture the XML data in the relational database. So, XML > count is necessarily separated from XML import. > >>> > >>> - The notion > of a declarative mechanism for extracting IDs from the relational database > (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at > the same time there is value in the arbitrary computation that can be done > with Java (case in point: export two versions of an ID, with and without > periods). This is not to say that it is impossible to do this > declaratively, but let's just say that the procedural approach exists here > and now, and a declarative approach will need more thought. > >>> > >>> > These, and other factors, are good thoughts to hold onto and would be worthy > of a good meeting discussion sometime, but bottom line for now: modifying > the export behavior is a matter of editing the *SpeciesProfile* Java code, > and not the gmbuilder.properties file. Turn your attention to that code. > > >>> > >>> Now, as to annotating your code...I'd just put in code comments > :) Or did you mean something else by tagging code in e-mail? > >>> > >>> > John David N. Dionisio, PhD > >>> Associate Professor, Computer Science > > >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21, > 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag > code in email so it holds its formatting? I tried a few suggestions I found > on the web but they aren't holding formatting or i'm just doing it wrong ;-D > > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM, > Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates > and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat > down and reviewed the gene testing report. We verified that XML match does > indeed find 4066 unique matches - 75 of which are not in the gdb and need to > be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who > completed the gene db testing report, not a previous student of BIO367 and > had already verified which genes were missing and where they were to be > found. I had (mistakenly) assumed that since a student had performed the > gene database testing I had to redo all of the verification. > >>> > > >>> > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > >>> > Per the gene db testing report: "69 of them have an "a", > "b", or "d" suffix. They are all found in the ORF tag and need to be > included in the gdb." > >>> > > >>> > To solve this we need to add > additional search criteria into the M. tuberculosis section in > gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > > > >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > > > >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> > > uniprot/entry/gene/name&type&ordered locus > >>> > > >>> > > mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*) > from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > > Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my > understanding that the second line is the query used by TallyEngine to read > the XML file? If so then this is the issue we need to table for the moment > until we get the gbd verified and re-released. We will revisit this to > discover why it is not only reporting incorrectly but also why its added a > second row of Ordered Locus on the TallyEngine results page. > >>> > > >>> > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > >>> > > >>> > > > >>> > # Mycobacterium tuberculosis > >>> > > >>> > > mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf > > >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > > > >>> > select count(*) from genenametype where type = 'ordered locus'; > > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from > genenametype where type = 'orf'; > >>> > > >>> > > >>> > > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus > > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > > > >>> > > ---------------------------------------------------------------------------------------------------------------------------- > > >>> > > >>> > Of course these queries would have be manually verified > prior to making these changes but this seems like we are moving in the right > direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17, > 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just > got done reading previous email and understand the change in priority. > > >>> > > >>> > Will work on the missing ID's for now and shelve the the > TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more > formalized weekly meeting. I was going to suggest it myself as it has been > slow going so far as maybe i'm a bit too independent in this independent > study class =D > >>> > > >>> > Will dig further into the missing ID's > later tonight and during day tomorrow and report back. > >>> > > >>> > > Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. > Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No > problem. The pertinent line you're referring to, for XML, is this, right > above the line you copied: > >>> > > >>> > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > >>> > The slash-separated section is the "path" of XML > tags leading to the element of interest; then, after the ampersand, is a > name/value pair for the desired attribute to count. Note that there is no > hint of a *content*-based filter (nor is there the capability for one, as > far as I can tell in the code). By "content," I mean that we can't specify > filters based on what's *between* the tags. We can only go as far as filter > by attribute value, e.g., type="ordered locus". > >>> > > >>> > But > anyway, as mentioned in the earlier e-mail, let's have the missing IDs in > the .gdb take precedence for now. Please take a look at the tuberculosis, > A. thaliana, and P. falciparum profiles to get an idea for how the ID output > can be customized, then let me know if you have any questions or need to > confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> > > Associate Professor, Computer Science > >>> > Loyola Marymount University > > >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous > wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment > that kept needing continued iteration and it has been all consuming until > last night. But I did get a chance to work with your comments and review the > code again with a different mind set. > >>> > > > >>> > > Yes, I examined > the gmbuilder.properties file ( the query is also in the > MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a > previous email ) but I don't think I see what you mean regarding the XML > count. > >>> > > > >>> > > I understood that: > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; was the db query but > don't see which is the XML count... or do they share the same query and you > are saying that XML count doesn't recognize and therefore cannot use the > 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > > > > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < > do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the > delay. Had some distractions coming into the weekend. > >>> > > > >>> > > > You've looked at the code; have you looked at gmbuilder.properties? (I may > have mentioned it a few e-mails ago, just as you were starting to dig into > this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block > looks like this (indentation is mine to set it apart): > >>> > > > >>> > > > # Mycobacterium tuberculosis > >>> > > > mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > > > mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered > locus > >>> > > > >>> > > > mycobacteriumtuberculosis_query_level0=select count(*) from genenametype > where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> > > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> > > > > >>> > > There, I think, is the rub. Notice that the XML count does > not filter on RV%. The SQL query does. > >>> > > > >>> > > Unfortunately, > I don't think the TallyEngine can include selective filtering in the XML > counts. If the need to do selective filtering on XML is necessary, then I > think we're looking at a new functionality for you to implement (or, if this > throws things off too much, this may have to be noted somewhere, that the > XML vs. database counts may be off because the database count is doing some > text-based filtering but the XML count does not). > >>> > > > >>> > > What > does xmlpipedb-match say? That will at least tell you whether the 'RV%' > count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD > > >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount > University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at > 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was > able to put together from the past few hours of code review: > >>> > > > > > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > > > -reveals that after the 2 System table modifications are made adding species > name and link, a PreparedStatement is instantiated which builds and calls > the base query. > >>> > > > > >>> > > > -The base query called is: > ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " > + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE > type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") > > >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for > any tuple that starts with Rv (followed by any substring) and > entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage > are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > > > -To me it seems the query makes sense so the problem is likely elsewhere. > > >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method > doTallies() at code line 895 which: > >>> > > > Instantiates a > Configuration called hibernateConfiguration and assigns to it the current > hibernate configuration > >>> > > > Validates database settings by > analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList > for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a > CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines > if both xml files exist > >>> > > > Then getTallyResultsXML and > getTallyResultsDatabase are run on both xml files and their respective > CriterionList > >>> > > > Results are then formatted for display in a > table. > >>> > > > > >>> > > > -So enum TallyType which means that they > are the only valid datatypes which TallyEngine accepts... go to know ... > > >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would > seem that both getTallyResultsXML() and getTallyResultsDatabase() are > incorrectly returning. Likely due to both using an incorrect query (as we > previously supposed). But where are the queries?... the more I dig the more > I think they are in the criterial all the work is done against. > >>> > > > > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML() > calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > > > getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new > QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced > methods originate from TallyEngine.java... > >>> > > > > >>> > > > > TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls > digestXmlFile() which instantiates a digester then processes against > criteria... but this quickly becomes confusing and is hard to follow > >>> > > > > > >>> > > > getDbcounts() then starts a db session and executes a > query but then I also get a bit lost with my limited db knowledge. > >>> > > > > > >>> > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues > but I still feel as if I'm missing some understanding to proceed further. > Can you pass along some of that Dondi insight and steer me in the right > direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken > databases yet certainly is limiting my ability determine where the > "criteria" are being set and how they are followed during session > activities. Also is the query we have been looking for this whole time in > the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again > is the query contained within the criteria that digestXmlFile() uses to > parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...> > wrote: > >>> > > > Right, schema issues are unlikely. Most count > discrepancies like this that I've seen have boiled down to forming the right > query. Then, knowing the right query (in both XML and SQL), it's a matter > of making sure that TallyEngine asks that same query. > >>> > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer > Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > > > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > > > >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1. > I'll start with reviewing the queries for xmlpipedb-match and sql queries > needed for the respective results as you requested. > >>> > > > > > >>> > > > > > I was also thinking I may need to review the schema from xml into > postgres but the issue isn't likely a schema error. The error most likely > lies in how xmlpipedbutils queries the data from xml source and writes to > the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the > code: trace the entrance of tally engine in the gmbuilder code then follow > it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> > > > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. > Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm > (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source > code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> > > > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate > Professor, Computer Science > >>> > > > > Loyola Marymount University > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at > 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, > > >>> > > > > > > >>> > > > > > So I'm at the point in working with M > tuberculosis that I was able to exactly reproduce Dr. Dahlquist's > problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60 > Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > Now the proverbial question - What next to solve the Ordered Locus > import/count issue? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > Here is my > thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the > import process work at the high level? (obviously correct me if I'm wrong) > > >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is > read, it is placed in the proper Postgres table(s) based on some criteria. > There is also likely some sort of check that each individual tag is in valid > XML format unless we don't care at this stage (care at export) or maybe the > parser just skips over and goes on to the next . > >>> > > > > > > >>> > > > > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > > > Either - > >>> > > > > > a. XML tags are being parsed incorrectly > (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table > they should be added to? > >>> > > > > > > >>> > > > > > > ********************************************** > >>> > > > > > > >>> > > > > > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > > > XMLPipeDB has a modular architecture with three components that may be used > separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and > automatically generates an SQL schema, Java classes, and Hibernate mappings. > XMLPipeDB Utilities provides functionality for configuring the database, > importing data, and performing queries. GenMAPP Builder is based on the > XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on > data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > > > So I should probably start with the XMLPipeDB Utilities which are where? I > don't see any in the basic distribution or are they not standalone and > called from the command line? > >>> > > > > > > >>> > > > > > Thanks! > > >>> > > > > > > >>> > > > > > Richard > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |