Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation
Brought to you by:
kdahlquist,
zugzugglug
From: Kam D. <kda...@lm...> - 2011-02-23 19:37:42
|
Hi, That sounds good to me. Best, Dr. D At 11:16 AM 2/23/2011, you wrote: >Hmm... possible that the method to split up records has been >globally implemented for export from Postgres? > >Since the records are automatically splitting when exported from >postgres to gdb file, all we need to do is exclude Rv3346/55c 'ORF' >from "select count (*) from genenametype where (type = 'ordered >locus' or type = 'ORF') and value ~ >'[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set. > >That seem correct Dondi? > >Richard > >On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: >Hi, > >I checked these out. They all should be separated and kept as >separate OrderedLocusNames records. In fact, the gdb already has >them separated into individual records, so it is only the tally for >Postgres that is off, they were correctly exported into the gdb. > >Best, >Dr. D > > >At 07:31 PM 2/22/2011, Richard Brous wrote: >>OK I found the culprits: >> >>There are 3 ordered locus gene ID's that have slashes. As was >>previously discussed it seems XML Match is counting each one twice >>which would inflate the count by 3. >> >>Rv2561/Rv2562 'ordered locus' >>Rv2880c/Rv2879c 'ordered locus' >>Rv3021c/Rv3022c 'ordered locus' >> >>And as was previously noted, Rv3346/55c 'ORF' should be excluded. >> >>This will bring the both XML and sql db queries counts in sync at 4057. >> >>By my math: >>------------------------------------------------------------------------------------------------------------ >>XML tags read by Match: >>4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates >>caused by slashes = total 4057 genes >>------------------------------------------------------------------------------------------------------------ >>sql query of the db tables: >>3988 ordered locus + 69 ORF = total 4057 genes >>------------------------------------------------------------------------------------------------------------ >> >> >>TO MOVE FORWARD: >> >>1. We need to decide how to address the slash ID's (should we keep >>and split into separate tuples or should they be excluded) >> >>2. Adjust the queries to reflect what is needed and test with raw sql. >> >>3. Update the queries on MycobacteriumTuberculosisUniProtSpeciesProfile.java >> >>4. Export a new Mtb gene database for testing. >> >> >>Richard >> >> >> >>On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist >><<mailto:kda...@lm...>kda...@lm...> wrote: >>Hi, >>I don't know why the numbers are off. I think the only way to find >>out is to get the results of the Postgres query and line it up next >>to the match results and see what is different. If you can send me >>the lists of IDs from the Postgres query and the match results, I >>can check them. >>Best, >>Dr. D >> >>At 04:57 PM 2/22/2011, you wrote: >>>Here is where I am on the numbers: >>>------------------------------------------------------------------------------------------------------------ >>> >>>XML tags read by Match: >>>4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes >>>------------------------------------------------------------------------------------------------------------ >>> >>>sql query of the db tables: >>>3988 ordered locus + 69 ORF = total 4057 genes >>>------------------------------------------------------------------------------------------------------------ >>> >>>So by my count we are off by 3 genes >>> >>>Maybe XML Match is counting slashed genes as 2 separate genes? >>> >>>IE. so if it encountered 3 slashed genes it would in effect be >>>counting those 3 as 6? >>> >>>Richard >>>On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist >>><<mailto:kda...@lm...>kda...@lm...> wrote: >>>Hi, >>>I compared this list of gene IDs with the list I had on the >>>Testing Report on the wiki and found the following. >>>Your list has 70 IDs. 69 of them are identical to what is in my >>>list of IDs in the Testing report. >>>Indeed, the only difference is the one with the >>>slash. "Rv3346/55c" I looked this ID up at UniProt.org as >>>Rv3346. It appears in the record for UniProt ID O50384. For that >>>UniProt record, the ID referred to in the OrderedLocus tag is >>>Rv3355c which does appear in the gdb already. Looking up Rv3346 >>>on Tuberculist and the Stanford TB database, Rv3346 is not a real >>>gene ID. So this entire record with the slash in it needs to be >>>excluded and all the rest of them need to be captured. >>>I think then, that the numbers should add up correctly, is this true? >>>Best, >>>Dr. D >>>At 01:09 PM 2/22/2011, Richard Brous wrote: >>>>attachement now included... >>>>On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous >>>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>Here is an export of the genes found using: select * from >>>>genenametype where type = 'ORF' and value ~ >>>>'[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. >>>> >>>>647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 >>>>5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 >>>>647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 >>>>647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 >>>>647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 >>>>647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 >>>>211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 >>>>648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 >>>>648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 >>>>648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 >>>>648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 >>>>648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 >>>>649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 >>>>649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 >>>>649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 >>>>649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 >>>>650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 >>>>174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 >>>>174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 >>>>312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 >>>>331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 >>>>445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 >>>>621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 >>>>622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 >>>>622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 >>>>622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 >>>>622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 >>>>623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 >>>>623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 >>>>623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 >>>>623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 >>>>624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 >>>>624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 >>>>625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 >>>>625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 >>>>626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 >>>>626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 >>>>626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 >>>>626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 >>>>627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 >>>>627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 >>>>627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 >>>>627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 >>>>627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 >>>>627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 >>>>627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 >>>>628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 >>>>629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 >>>>629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 >>>>629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 >>>>630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 >>>>630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 >>>>631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 >>>>632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 >>>>632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 >>>>633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 >>>>633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 >>>>645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 >>>>645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 >>>>645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 >>>>645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 >>>>645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 >>>>646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 >>>>646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 >>>>646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 >>>>646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 >>>>646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 >>>>646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 >>>>647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 >>>>647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 >>>>*****The item of note I see is the gene with the slash separating >>>>gene id's which refer to the same gene. >>>> >>>>Richard >>>> >>>>On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous >>>><<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>Understood. >>>> >>>>I'll check in with Dr. D in the afternoon tomorrow and discuss. >>>> >>>>Richard >>>> >>>>On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio >>>><<mailto:do...@lm...>do...@lm...> wrote: >>>>Hi Rich, >>>>Addressing the release business first, let's put it this way: if >>>>the remaining loose ends can be addressed by tomorrow, we can >>>>probably wait until then. If unexpected snags are encountered, >>>>then it would be worthwhile to release whatever you have. >>>>With that in mind, considering that you pretty much know the >>>>patterns of the IDs that are needed, I think it will only take a >>>>little digital forensic work now to figure out exactly which IDs >>>>are still needed. Once you know what those are, you should: >>>>1. Find where they are in the XML file. >>>>2. Knowing the XML location, find the corresponding table in the >>>>relational database (table names are generally derived from >>>>tag/element names). >>>>3. Knowing the table in the database, write or extend the >>>>SpeciesProfile query to retrieve that data. >>>>For the ID that must *not* be included, again it's a matter of >>>>tracking down what this ID is. Knowing this straggler, you can >>>>then consult with Dr. Dahlquist if this ID is truly a unique >>>>one-off, or is representative of a pattern that we'll want to >>>>exclude. Either way, this ID can be omitted by using "not" or >>>>"<>" or possibly "not like" or "not ~" (check PostgreSQL where >>>>clause syntax to see where the negation can be applied). >>>>John David N. Dionisio, PhD >>>>Associate Professor, Computer Science >>>>Loyola Marymount University >>>>On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >>>> > actually i had a typo (emailing from desktop system but >>>> testing on my laptop... typed correctly here but wrong in >>>> pgadmin) but the results make much more sense now. >>>> > >>>> > >>>> > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value like 'Rv%'; >>>> > returns 4058 >>>> > >>>> > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > returns 4058 >>>> > >>>> > >>>> -------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>>> > >>>> > Continuing forward - >>>> > >>>> > The testing report says that 4066 unique matches exist in XML >>>> but 6 of them were eliminated by Dr. D leaving the desired number at 4060. >>>> > >>>> > So now we are only 2 genes short with the query returning >>>> 4058... which is also (conveniently) the sum of the two separate >>>> queries of 'ordered locus' and 'ORF' respectively. >>>> > >>>> > But recall that Dr. D said that only 69 genes of the missing >>>> 75 were tagged 'ORF' but we seem to have 1 extra gene tagged >>>> 'ORF' than we expected. Adding that into missing genes puts us 3 short... >>>> > >>>> > Should I make the changes to the code and export a gdb so that >>>> analysis can be done or wait until we work this through further? >>>> > >>>> > Richard >>>> > >>>> > >>>> > >>>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > Hi Rich, >>>> > >>>> > The second form should have worked actually. What exactly was >>>> the error? >>>> > >>>> > John David N. Dionisio, PhD >>>> > Associate Professor, Computer Science >>>> > Loyola Marymount University >>>> > >>>> > >>>> > >>>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >>>> > >>>> > > hmm not taking parenthesis where I thought they should go... >>>> syntax error >>>> > > >>>> > > select count (*) from genenametype where type = ('ordered >>>> locus' or 'ORF') and value like 'Rv%'; >>>> > > also tried >>>> > > select count (*) from genenametype where (type = 'ordered >>>> locus' or type = 'ORF') and value like 'Rv%'; >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > > ah yes... i see it... >>>> > > >>>> > > >>>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > > Watch your parentheses: "and" has greater precedence than "or" :) >>>> > > >>>> > > >>>> > > John David N. Dionisio, PhD >>>> > > Associate Professor, Computer Science >>>> > > Loyola Marymount University >>>> > > >>>> > > >>>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > > >>>> > >> OK, so here are my query results from raw SQL: >>>> > >> >>>> > >> 1. using: like 'Rv%' >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' and value like 'Rv%'; >>>> > >> returns 3988 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ORF' and >>>> value like 'Rv%'; >>>> > >> returns 70 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' or type = 'ORF' and value like 'Rv%'; >>>> > >> returns 7011 >>>> > >> >>>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 3988 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ordered >>>> locus' or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 7011 >>>> > >> >>>> > >> select count (*) from genenametype where type = 'ORF' and >>>> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >>>> > >> returns 70 >>>> > >> >>>> > >> Conclusions: >>>> > >> >>>> > >> 1. It seems that querying for type = 'ORF' alone surfaces >>>> the 69 genes were were looking for plus one more (maybe the >>>> count for missing genes is off by 1?). >>>> > >> >>>> > >> 2. Combining the two types in a single query did not >>>> produce the results that I expected (7011? - how did that >>>> happen????) so this is likely not our solution... unless of >>>> course the query syntax isn't actually doing what I think it is... >>>> > >> >>>> > >> 3. I would think the best course of action is to serialy >>>> run two separate queries to capture all the required genes, then >>>> removing the one unneeded gene if its truly not wanted. >>>> > >> >>>> > >> What do you think? >>>> > >> >>>> > >> Richard >>>> > >> >>>> > >> >>>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >> I don't recall the exact details of the missing 69, but if >>>> your query successfully returns them in raw SQL, then this is >>>> worth a try. You can integrate into the same query as long as >>>> the same columns are returned, which is the case here AFAIK, so >>>> go ahead and extend the existing query. >>>> > >> >>>> > >> >>>> > >> John David N. Dionisio, PhD >>>> > >> Associate Professor, Computer Science >>>> > >> Loyola Marymount University >>>> > >> >>>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >> >>>> > >>> So here is the appropriate code snippet from >>>> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>> > >>> public >>>> > >>> >>>> > >>> TableManager >>>> getSystemTableManagerCustomizations(TableManager tableManager, >>>> TableManager primarySystemTableManager, Date version) throws >>>> SQLException, InvalidParameterException { >>>> > >>> >>>> > >>> // Build the base query; we only use "ordered locus" and >>>> we only want >>>> > >>> >>>> > >>> // IDs that begin with "Rv." >>>> > >>> PreparedStatement ps = >>>> ConnectionManager.getRelationalDBConnection().prepareStatement( >>>> > >>> >>>> > >>> "SELECT value, type " + >>>> > >>> >>>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >>>> > >>> >>>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>>> > >>> >>>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>>> entrytype_gene_hjid = ?"); >>>> > >>> ResultSet result; >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> for (Row row : primarySystemTableManager.getRows()) { >>>> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >>>> > >>> >>>> > >>> "UID"))); >>>> > >>> result = ps.executeQuery(); >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> // We actually want to keep the case where multiple ordered locus >>>> > >>> >>>> > >>> // names appear. >>>> > >>> >>>> > >>> while (result.next()) { >>>> > >>> >>>> > >>> // We want this name to appear in the OrderedLocusNames >>>> > >>> >>>> > >>> // system table. >>>> > >>> >>>> > >>> for (String id : result.getString("value").split("/")) { >>>> > >>> tableManager.submit( >>>> > >>> >>>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { >>>> "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { >>>> "\"Date\"", >>>> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { >>>> "UID", row.getValue("UID") } }); >>>> > >>> } >>>> > >>> >>>> > >>> } >>>> > >>> >>>> > >>> } >>>> > >>> >>>> > >>> >>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>> So now we want to build the base query which uses "ordered >>>> locus" and "orf" and we only want IDs that begin with "Rv". >>>> > >>> >>>> > >>> I know there are more comprehensive ways to search for >>>> gene ID's by matching gene ID prefix but "like Rv%" seemed to >>>> work thus far, we just need to tell it to search for XML tag >>>> type orf in addition to ordered locus. >>>> > >>> >>>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value >>>> like 'Rv%' and entrytype_gene_hjid = ? " >>>> > >>> >>>> > >>> Here is a stab at it.... This part of our class was right >>>> as the server went down and my submission for week 6 assignment >>>> I can't seem to find. >>>> > >>> >>>> > >>> Is it possible to have two different types in the same >>>> query or should we rewrite a separate query for the orf tag? >>>> > >>> >>>> > >>> Richard >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> >>>> > >>> thanks and will do as directed. >>>> > >>> >>>> > >>> My previous, last paragraph comment - A way for >>>> programming code in email holding its format in a mail message >>>> similarly to how you can post code on forum pages? >>>> > >>> >>>> > >>> <code> >>>> > >>> blah >>>> > >>> blah >>>> > >>> blah >>>> > >>> </code> >>>> > >>> >>>> > >>> thanks! >>>> > >>> >>>> > >>> Richard >>>> > >>> >>>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> >>>> > >>> Greetings, >>>> > >>> >>>> > >>> Actually, gmbuilder.properties is for the TallyEngine >>>> only. When dealing with .gdb exports, look *only* at the >>>> SpeciesProfile class. So, to find those 69 IDs, it is the >>>> SpeciesProfile code, and *only* the SpeciesProfile code, that >>>> needs to be changed. >>>> > >>> >>>> > >>> Your take on how gmbuilder.properties is used, however, is >>>> understandable. It makes sense to assume that the TallyEngine >>>> code *and* the ID export code are based on the same >>>> characterization of the needed IDs. This replication is >>>> originally a historical artifact: SpeciesProfile was done first, >>>> and then TallyEngine was done later by another student. >>>> > >>> >>>> > >>> However, there are other factors beyond history that sort >>>> of necessitate this duplication of desired IDs: (skip the two >>>> bullets below if you'd rather cut to the chase of the work to be >>>> done, and discuss design issues later) >>>> > >>> >>>> > >>> - The actual XML import code is a black box: this is the >>>> "canned" JAXB library actually in action, and not our code at >>>> all. Plus, the XML import code really does not filter (nor >>>> should it), since the goal of the XML->relational database step >>>> is to fully capture the XML data in the relational >>>> database. So, XML count is necessarily separated from XML import. >>>> > >>> >>>> > >>> - The notion of a declarative mechanism for extracting IDs >>>> from the relational database (which is what >>>> gmbuilder.properties/TallyEngine uses) is interesting, but at >>>> the same time there is value in the arbitrary computation that >>>> can be done with Java (case in point: export two versions of an >>>> ID, with and without periods). This is not to say that it is >>>> impossible to do this declaratively, but let's just say that the >>>> procedural approach exists here and now, and a declarative >>>> approach will need more thought. >>>> > >>> >>>> > >>> These, and other factors, are good thoughts to hold onto >>>> and would be worthy of a good meeting discussion sometime, but >>>> bottom line for now: modifying the export behavior is a matter >>>> of editing the *SpeciesProfile* Java code, and not the >>>> gmbuilder.properties file. Turn your attention to that code. >>>> > >>> >>>> > >>> Now, as to annotating your code...I'd just put in code >>>> comments :) Or did you mean something else by tagging code in e-mail? >>>> > >>> >>>> > >>> John David N. Dionisio, PhD >>>> > >>> Associate Professor, Computer Science >>>> > >>> Loyola Marymount University >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >>>> > >>> >>>> > >>> > also, how do I tag code in email so it holds its >>>> formatting? I tried a few suggestions I found on the web but >>>> they aren't holding formatting or i'm just doing it wrong ;-D >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> > OK, have some updates and some suggestions: >>>> > >>> > >>>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the >>>> gene testing report. We verified that XML match does indeed find >>>> 4066 unique matches - 75 of which are not in the gdb and need to be. >>>> > >>> > >>>> > >>> > Dr. Dahlquist informed me that she was the one who >>>> completed the gene db testing report, not a previous student of >>>> BIO367 and had already verified which genes were missing and >>>> where they were to be found. I had (mistakenly) assumed that >>>> since a student had performed the gene database testing I had to >>>> redo all of the verification. >>>> > >>> > >>>> > >>> > So that said, of the 75 genes missing - 69 need to be >>>> included and 6 excluded. >>>> > >>> > Per the gene db testing report: "69 of them have an "a", >>>> "b", or "d" suffix. They are all found in the ORF tag and need >>>> to be included in the gdb." >>>> > >>> > >>>> > >>> > To solve this we need to add additional search criteria >>>> into the M. tuberculosis section in gmbuilder.properties below: >>>> > >>> > # Mycobacterium tuberculosis >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_level_amount= >>>> > >>> > >>>> > >>> > 1 >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_element_level0= >>>> > >>> > >>>> > >>> > uniprot/entry/gene/name&type&ordered locus >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_query_level0= >>>> > >>> > >>>> > >>> > select count(*) from genenametype where type = 'ordered >>>> locus' and value like 'Rv%'; >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>> > >>>> > >>> > Ordered Locus >>>> > >>> > SOLUTIONS: >>>> > >>> > >>>> > >>> > 1. So am i correct in my understanding that the second >>>> line is the query used by TallyEngine to read the XML file? If >>>> so then this is the issue we need to table for the moment until >>>> we get the gbd verified and re-released. We will revisit this to >>>> discover why it is not only reporting incorrectly but also why >>>> its added a second row of Ordered Locus on the TallyEngine results page. >>>> > >>> > >>>> > >>> > 2. The third line is the SQL query used by postgres >>>> during the export from XML to gdb. To find and get the ORF >>>> tagged genes could we not add the following lines and change the >>>> count in the first line: >>>> > >>> > >>>> > >>> > >>>> > >>> > # Mycobacterium tuberculosis >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_level_amount=2 >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >>>> > >>> > >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_query_level0= >>>> > >>> > >>>> > >>> > select count(*) from genenametype where type = 'ordered locus'; >>>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) >>>> from genenametype where type = 'orf'; >>>> > >>> > >>>> > >>> > >>>> > >>> > mycobacteriumtuberculosis_table_name_level0= >>>> > >>> > >>>> > >>> > Ordered Locus >>>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >>>> > >>> > >>>> > >>> > >>>> ---------------------------------------------------------------------------------------------------------------------------- >>>> >>>> > >>> > >>>> > >>> > Of course these queries would have be manually verified >>>> prior to making these changes but this seems like we are moving >>>> in the right direction. >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > >>>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous >>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>> > >>> > Just got done reading previous email and understand the >>>> change in priority. >>>> > >>> > >>>> > >>> > Will work on the missing ID's for now and shelve the the >>>> TalleyEngine issue for the moment. >>>> > >>> > >>>> > >>> > Also great about a more formalized weekly meeting. I was >>>> going to suggest it myself as it has been slow going so far as >>>> maybe i'm a bit too independent in this independent study class =D >>>> > >>> > >>>> > >>> > Will dig further into the missing ID's later tonight and >>>> during day tomorrow and report back. >>>> > >>> > >>>> > >>> > Richard >>>> > >>> > >>>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio >>>> <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > Hi Rich, >>>> > >>> > >>>> > >>> > No problem. The pertinent line you're referring to, for >>>> XML, is this, right above the line you copied: >>>> > >>> > >>>> > >>> > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > >>>> > >>> > The slash-separated section is the "path" of XML tags >>>> leading to the element of interest; then, after the ampersand, >>>> is a name/value pair for the desired attribute to count. Note >>>> that there is no hint of a *content*-based filter (nor is there >>>> the capability for one, as far as I can tell in the code). By >>>> "content," I mean that we can't specify filters based on what's >>>> *between* the tags. We can only go as far as filter by >>>> attribute value, e.g., type="ordered locus". >>>> > >>> > >>>> > >>> > But anyway, as mentioned in the earlier e-mail, let's >>>> have the missing IDs in the .gdb take precedence for >>>> now. Please take a look at the tuberculosis, A. thaliana, and >>>> P. falciparum profiles to get an idea for how the ID output can >>>> be customized, then let me know if you have any questions or >>>> need to confirm anything. >>>> > >>> > >>>> > >>> > John David N. Dionisio, PhD >>>> > >>> > Associate Professor, Computer Science >>>> > >>> > Loyola Marymount University >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>>> > >>> > >>>> > >>> > > Sorry been slammed with a programming assignment that >>>> kept needing continued iteration and it has been all consuming >>>> until last night. But I did get a chance to work with your >>>> comments and review the code again with a different mind set. >>>> > >>> > > >>>> > >>> > > Yes, I examined the gmbuilder.properties file ( the >>>> query is also in the >>>> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >>>> in a previous email ) but I don't think I see what you mean >>>> regarding the XML count. >>>> > >>> > > >>>> > >>> > > I understood that: >>>> mycobacteriumtuberculosis_query_level0=select count(*) from >>>> genenametype where type = 'ordered locus' and value like >>>> 'Rv%'; was the db query but don't see which is the XML count... >>>> or do they share the same query and you are saying that XML >>>> count doesn't recognize and therefore cannot use the 'Rv%' parameter? >>>> > >>> > > >>>> > >>> > > Richard >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > Hi Rich, >>>> > >>> > > >>>> > >>> > > Sorry for the delay. Had some distractions coming >>>> into the weekend. >>>> > >>> > > >>>> > >>> > > You've looked at the code; have you looked at >>>> gmbuilder.properties? (I may have mentioned it a few e-mails >>>> ago, just as you were starting to dig into this) >>>> > >>> > > >>>> > >>> > > On the copy I have, the M. tuberculosis block looks >>>> like this (indentation is mine to set it apart): >>>> > >>> > > >>>> > >>> > > # Mycobacterium tuberculosis >>>> > >>> > > mycobacteriumtuberculosis_level_amount=1 >>>> > >>> > > >>>> > >>> > > >>>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>>> locus >>>> > >>> > > >>>> > >>> > > mycobacteriumtuberculosis_query_level0=select >>>> count(*) from genenametype where type = 'ordered locus' and value like 'Rv%'; >>>> > >>> > > >>>> > >>> > > >>>> mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>>> > >>> > > >>>> > >>> > > There, I think, is the rub. Notice that the XML count >>>> does not filter on RV%. The SQL query does. >>>> > >>> > > >>>> > >>> > > Unfortunately, I don't think the TallyEngine can >>>> include selective filtering in the XML counts. If the need to >>>> do selective filtering on XML is necessary, then I think we're >>>> looking at a new functionality for you to implement (or, if this >>>> throws things off too much, this may have to be noted somewhere, >>>> that the XML vs. database counts may be off because the database >>>> count is doing some text-based filtering but the XML count does not). >>>> > >>> > > >>>> > >>> > > What does xmlpipedb-match say? That will at least >>>> tell you whether the 'RV%' count is indeed correct. >>>> > >>> > > >>>> > >>> > > John David N. Dionisio, PhD >>>> > >>> > > Associate Professor, Computer Science >>>> > >>> > > Loyola Marymount University >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > >>>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>>> > >>> > > >>>> > >>> > > > OK here is what I was able to put together from the >>>> past few hours of code review: >>>> > >>> > > > >>>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>>> > >>> > > > -reveals that after the 2 System table modifications >>>> are made adding species name and link, a PreparedStatement is >>>> instantiated which builds and calls the base query. >>>> > >>> > > > >>>> > >>> > > > -The base query called is: ("SELECT value, type " + >>>> "FROM genenametype INNER JOIN entrytype_genetype " + >>>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >>>> "WHERE type = 'ordered locus' and value like 'Rv%' and >>>> entrytype_gene_hjid = ?") >>>> > >>> > > > >>>> > >>> > > > -So its looking in 'ordered locus' table/column for >>>> any tuple that starts with Rv (followed by any substring) and >>>> entrytype_gene_hjid = ? . >>>> > >>> > > > The 'like' comparator and % usage are clear with the >>>> 'type' entrytype_gene_hjid = ? >>>> > >>> > > > >>>> > >>> > > > -To me it seems the query makes sense so the problem >>>> is likely elsewhere. >>>> > >>> > > > >>>> > >>> > > > GenMappBuilder.java: >>>> > >>> > > > -I found method doTallies() at code line 895 which: >>>> > >>> > > > Instantiates a Configuration called >>>> hibernateConfiguration and assigns to it the current hibernate configuration >>>> > >>> > > > Validates database settings by analyzing >>>> hibernateConfiguration >>>> > >>> > > > Instantiates a CriterionList for uniprot and assigns >>>> to it TallyType.UNIPROT >>>> > >>> > > > Instantiates a CriterionList for go and assigns to >>>> it TallyType.GO >>>> > >>> > > > Determines if both xml files exist >>>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase >>>> are run on both xml files and their respective CriterionList >>>> > >>> > > > Results are then formatted for display in a table. >>>> > >>> > > > >>>> > >>> > > > -So enum TallyType which means that they are the >>>> only valid datatypes which TallyEngine accepts... go to know ... >>>> > >>> > > > >>>> > >>> > > > -Based on the screen shot of Tally Engine it would >>>> seem that both getTallyResultsXML() and >>>> getTallyResultsDatabase() are incorrectly returning. Likely due >>>> to both using an incorrect query (as we previously supposed). >>>> But where are the queries?... the more I dig the more I think >>>> they are in the criterial all the work is done against. >>>> > >>> > > > >>>> > >>> > > > continuing the review: >>>> > >>> > > > getTallyResultsXML() calls Tally Engine instance >>>> method getXmlFileCounts(xmlFile) >>>> > >>> > > > getTallyResultsDatabase() calls Tally Engine >>>> instance method getDbcounts(new QueryEngine(hibernateConfiguration) >>>> > >>> > > > Both of these instanced methods originate from >>>> TallyEngine.java... >>>> > >>> > > > >>>> > >>> > > > TallyEngine.java: >>>> > >>> > > > >>>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which >>>> instantiates a digester then processes against criteria... but >>>> this quickly becomes confusing and is hard to follow >>>> > >>> > > > >>>> > >>> > > > getDbcounts() then starts a db session and executes >>>> a query but then I also get a bit lost with my limited db knowledge. >>>> > >>> > > > >>>> > >>> > > > >>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>> >>>> > >>> > > > >>>> > >>> > > > OVERALL I think I'm getting closer to the issues but >>>> I still feel as if I'm missing some understanding to proceed >>>> further. Can you pass along some of that Dondi insight and steer >>>> me in the right direction? =D >>>> > >>> > > > >>>> > >>> > > > -DB Tally - Not having taken databases yet certainly >>>> is limiting my ability determine where the "criteria" are being >>>> set and how they are followed during session activities. Also is >>>> the query we have been looking for this whole time in the >>>> criteria or someplace else? >>>> > >>> > > > >>>> > >>> > > > -XML Tally - again is the query contained within the >>>> criteria that digestXmlFile() uses to parse? >>>> > >>> > > > >>>> > >>> > > > Richard >>>> > >>> > > > >>>> > >>> > > > >>>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > > Right, schema issues are unlikely. Most count >>>> discrepancies like this that I've seen have boiled down to >>>> forming the right query. Then, knowing the right query (in both >>>> XML and SQL), it's a matter of making sure that TallyEngine asks >>>> that same query. >>>> > >>> > > > >>>> > >>> > > > John David N. Dionisio, PhD >>>> > >>> > > > Associate Professor, Computer Science >>>> > >>> > > > Loyola Marymount University >>>> > >>> > > > >>>> > >>> > > > >>>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>>> > >>> > > > >>>> > >>> > > > > OK, so based on your approach: >>>> > >>> > > > > >>>> > >>> > > > > 1. I'll start with reviewing the queries for >>>> xmlpipedb-match and sql queries needed for the respective >>>> results as you requested. >>>> > >>> > > > > >>>> > >>> > > > > I was also thinking I may need to review the >>>> schema from xml into postgres but the issue isn't likely a >>>> schema error. The error most likely lies in how xmlpipedbutils >>>> queries the data from xml source and writes to the tables what it returns? >>>> > >>> > > > > >>>> > >>> > > > > 2. I'll review the code: trace the entrance of >>>> tally engine in the gmbuilder code then follow it through the xmlpipedbutils. >>>> > >>> > > > > >>>> > >>> > > > > Richard >>>> > >>> > > > > >>>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. >>>> Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the >>>> first e-mail) --- the XMLPipeDB Utilities source code is in >>>> trunk/xmlpipedbutils in SourceForge's Subversion repo. >>>> > >>> > > > > >>>> > >>> > > > > John David N. Dionisio, PhD >>>> > >>> > > > > Associate Professor, Computer Science >>>> > >>> > > > > Loyola Marymount University >>>> > >>> > > > > >>>> > >>> > > > > >>>> > >>> > > > > >>>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>>> > >>> > > > > >>>> > >>> > > > > > Hi Dondi, >>>> > >>> > > > > > >>>> > >>> > > > > > So I'm at the point in working with M >>>> tuberculosis that I was able to exactly reproduce Dr. >>>> Dahlquist's problematic TallyEngine results. >>>> > >>> > > > > > >>>> > >>> > > > > > gmb2b60 Results >>>> > >>> > > > > > >>>> > >>> > > > > > >>>> > >>> > > > > > >>>> > >>> > > > > > Now the proverbial question - What next to solve >>>> the Ordered Locus import/count issue? >>>> > >>> > > > > > >>>> > >>> > > > > > ********************************************** >>>> > >>> > > > > > Here is my thought process: >>>> > >>> > > > > > >>>> > >>> > > > > > Step 1: How does the import process work at the >>>> high level? (obviously correct me if I'm wrong) >>>> > >>> > > > > > >>>> > >>> > > > > > I believe that basically as each XML tag is >>>> read, it is placed in the proper Postgres table(s) based on some >>>> criteria. There is also likely some sort of check that each >>>> individual tag is in valid XML format unless we don't care at >>>> this stage (care at export) or maybe the parser just skips over >>>> and goes on to the next . >>>> > >>> > > > > > >>>> > >>> > > > > > Step 2: What could be the problem? >>>> > >>> > > > > > >>>> > >>> > > > > > Either - >>>> > >>> > > > > > a. XML tags are being parsed incorrectly >>>> (ignored/skipped)? >>>> > >>> > > > > > b. Decision criteria of which table they should >>>> be added to? >>>> > >>> > > > > > >>>> > >>> > > > > > ********************************************** >>>> > >>> > > > > > >>>> > >>> > > > > > I read on the sourceforge wiki: >>>> > >>> > > > > > >>>> > >>> > > > > > XMLPipeDB has a modular architecture with three >>>> components that may be used separately or together. XSD-to-DB >>>> reads an XSD (XML Schema Definition) and automatically generates >>>> an SQL schema, Java classes, and Hibernate mappings. XMLPipeDB >>>> Utilities provides functionality for configuring the database, >>>> importing data, and performing queries. GenMAPP Builder is based >>>> on the XMLPipeDB Utilities and exports GenMAPP-compatible Gene >>>> Databases based on data from UniProt and Gene Ontology (GO). >>>> > >>> > > > > > >>>> > >>> > > > > > So I should probably start with the XMLPipeDB >>>> Utilities which are where? I don't see any in the basic >>>> distribution or are they not standalone and called from the command line? >>>> > >>> > > > > > >>>> > >>> > > > > > Thanks! >>>> > >>> > > > > > >>>> > >>> > > > > > Richard > > > >------------------------------------------------------------------------------ >Free Software Download: Index, Search & Analyze Logs and other IT data in >Real-Time with Splunk. Collect, index and harness all the fast moving IT data >generated by your applications, servers and devices whether physical, virtual >or in the cloud. Deliver compliance at lower cost and gain new business >insights. ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > >------------------------------------------------------------------------------ >Free Software Download: Index, Search & Analyze Logs and other IT data in >Real-Time with Splunk. Collect, index and harness all the fast moving IT data >generated by your applications, servers and devices whether physical, virtual >or in the cloud. Deliver compliance at lower cost and gain new business >insights. ><http://p.sf.net/sfu/splunk-dev2dev>http://p.sf.net/sfu/splunk-dev2dev >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |