Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-02-22 21:09:31
|
Here is an export of the genes found using: select * from genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file. 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The item of note I see is the gene with the slash separating gene id's which refer to the same gene. Richard On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...> wrote: > Understood. > > I'll check in with Dr. D in the afternoon tomorrow and discuss. > > Richard > > On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...>wrote: > >> Hi Rich, >> >> Addressing the release business first, let's put it this way: if the >> remaining loose ends can be addressed by tomorrow, we can probably wait >> until then. If unexpected snags are encountered, then it would be >> worthwhile to release whatever you have. >> >> With that in mind, considering that you pretty much know the patterns of >> the IDs that are needed, I think it will only take a little digital forensic >> work now to figure out exactly which IDs are still needed. Once you know >> what those are, you should: >> >> 1. Find where they are in the XML file. >> 2. Knowing the XML location, find the corresponding table in the >> relational database (table names are generally derived from tag/element >> names). >> 3. Knowing the table in the database, write or extend the SpeciesProfile >> query to retrieve that data. >> >> For the ID that must *not* be included, again it's a matter of tracking >> down what this ID is. Knowing this straggler, you can then consult with Dr. >> Dahlquist if this ID is truly a unique one-off, or is representative of a >> pattern that we'll want to exclude. Either way, this ID can be omitted by >> using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL >> where clause syntax to see where the negation can be applied). >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: >> >> > actually i had a typo (emailing from desktop system but testing on my >> laptop... typed correctly here but wrong in pgadmin) but the results make >> much more sense now. >> > >> > >> > select count (*) from genenametype where (type = 'ordered locus' or type >> = 'ORF') and value like 'Rv%'; >> > returns 4058 >> > >> > select count (*) from genenametype where (type = 'ordered locus' or type >> = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > returns 4058 >> > >> > >> -------------------------------------------------------------------------------------------------------------------------- >> > >> > >> > Continuing forward - >> > >> > The testing report says that 4066 unique matches exist in XML but 6 of >> them were eliminated by Dr. D leaving the desired number at 4060. >> > >> > So now we are only 2 genes short with the query returning 4058... which >> is also (conveniently) the sum of the two separate queries of 'ordered >> locus' and 'ORF' respectively. >> > >> > But recall that Dr. D said that only 69 genes of the missing 75 were >> tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. >> Adding that into missing genes puts us 3 short... >> > >> > Should I make the changes to the code and export a gdb so that analysis >> can be done or wait until we work this through further? >> > >> > Richard >> > >> > >> > >> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > Hi Rich, >> > >> > The second form should have worked actually. What exactly was the >> error? >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote: >> > >> > > hmm not taking parenthesis where I thought they should go... syntax >> error >> > > >> > > select count (*) from genenametype where type = ('ordered locus' or >> 'ORF') and value like 'Rv%'; >> > > also tried >> > > select count (*) from genenametype where (type = 'ordered locus' or >> type = 'ORF') and value like 'Rv%'; >> > > >> > > >> > > >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...> >> wrote: >> > > ah yes... i see it... >> > > >> > > >> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > Watch your parentheses: "and" has greater precedence than "or" :) >> > > >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...> >> wrote: >> > > >> > >> OK, so here are my query results from raw SQL: >> > >> >> > >> 1. using: like 'Rv%' >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' and >> value like 'Rv%'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and value like >> 'Rv%'; >> > >> returns 70 >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' or >> type = 'ORF' and value like 'Rv%'; >> > >> returns 7011 >> > >> >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' and >> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 3988 >> > >> >> > >> select count (*) from genenametype where type = 'ordered locus' or >> type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 7011 >> > >> >> > >> select count (*) from genenametype where type = 'ORF' and value ~ >> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; >> > >> returns 70 >> > >> >> > >> Conclusions: >> > >> >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69 >> genes were were looking for plus one more (maybe the count for missing genes >> is off by 1?). >> > >> >> > >> 2. Combining the two types in a single query did not produce the >> results that I expected (7011? - how did that happen????) so this is likely >> not our solution... unless of course the query syntax isn't actually doing >> what I think it is... >> > >> >> > >> 3. I would think the best course of action is to serialy run two >> separate queries to capture all the required genes, then removing the one >> unneeded gene if its truly not wanted. >> > >> >> > >> What do you think? >> > >> >> > >> Richard >> > >> >> > >> >> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >> I don't recall the exact details of the missing 69, but if your query >> successfully returns them in raw SQL, then this is worth a try. You can >> integrate into the same query as long as the same columns are returned, >> which is the case here AFAIK, so go ahead and extend the existing query. >> > >> >> > >> >> > >> John David N. Dionisio, PhD >> > >> Associate Professor, Computer Science >> > >> Loyola Marymount University >> > >> >> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...> >> wrote: >> > >> >> > >>> So here is the appropriate code snippet from >> MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> public >> > >>> >> > >>> TableManager getSystemTableManagerCustomizations(TableManager >> tableManager, TableManager primarySystemTableManager, Date version) throws >> SQLException, InvalidParameterException { >> > >>> >> > >>> // Build the base query; we only use "ordered locus" and we only >> want >> > >>> >> > >>> // IDs that begin with "Rv." >> > >>> PreparedStatement ps = >> ConnectionManager.getRelationalDBConnection().prepareStatement( >> > >>> >> > >>> "SELECT value, type " + >> > >>> >> > >>> "FROM genenametype INNER JOIN entrytype_genetype " + >> > >>> >> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + >> > >>> >> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and >> entrytype_gene_hjid = ?"); >> > >>> ResultSet result; >> > >>> >> > >>> >> > >>> >> > >>> for (Row row : primarySystemTableManager.getRows()) { >> > >>> ps.setInt(1, Integer.parseInt(row.getValue( >> > >>> >> > >>> "UID"))); >> > >>> result = ps.executeQuery(); >> > >>> >> > >>> >> > >>> >> > >>> // We actually want to keep the case where multiple ordered locus >> > >>> >> > >>> // names appear. >> > >>> >> > >>> while (result.next()) { >> > >>> >> > >>> // We want this name to appear in the OrderedLocusNames >> > >>> >> > >>> // system table. >> > >>> >> > >>> for (String id : result.getString("value").split("/")) { >> > >>> tableManager.submit( >> > >>> >> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id >> }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", >> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID", >> row.getValue("UID") } }); >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> } >> > >>> >> > >>> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> > >>> So now we want to build the base query which uses "ordered locus" >> and "orf" and we only want IDs that begin with "Rv". >> > >>> >> > >>> I know there are more comprehensive ways to search for gene ID's by >> matching gene ID prefix but "like Rv%" seemed to work thus far, we just need >> to tell it to search for XML tag type orf in addition to ordered locus. >> > >>> >> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' >> and entrytype_gene_hjid = ? " >> > >>> >> > >>> Here is a stab at it.... This part of our class was right as the >> server went down and my submission for week 6 assignment I can't seem to >> find. >> > >>> >> > >>> Is it possible to have two different types in the same query or >> should we rewrite a separate query for the orf tag? >> > >>> >> > >>> Richard >> > >>> >> > >>> >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> >> wrote: >> > >>> >> > >>> thanks and will do as directed. >> > >>> >> > >>> My previous, last paragraph comment - A way for programming code in >> email holding its format in a mail message similarly to how you can post >> code on forum pages? >> > >>> >> > >>> <code> >> > >>> blah >> > >>> blah >> > >>> blah >> > >>> </code> >> > >>> >> > >>> thanks! >> > >>> >> > >>> Richard >> > >>> >> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> >> > >>> Greetings, >> > >>> >> > >>> Actually, gmbuilder.properties is for the TallyEngine only. When >> dealing with .gdb exports, look *only* at the SpeciesProfile class. So, to >> find those 69 IDs, it is the SpeciesProfile code, and *only* the >> SpeciesProfile code, that needs to be changed. >> > >>> >> > >>> Your take on how gmbuilder.properties is used, however, is >> understandable. It makes sense to assume that the TallyEngine code *and* >> the ID export code are based on the same characterization of the needed IDs. >> This replication is originally a historical artifact: SpeciesProfile was >> done first, and then TallyEngine was done later by another student. >> > >>> >> > >>> However, there are other factors beyond history that sort of >> necessitate this duplication of desired IDs: (skip the two bullets below if >> you'd rather cut to the chase of the work to be done, and discuss design >> issues later) >> > >>> >> > >>> - The actual XML import code is a black box: this is the "canned" >> JAXB library actually in action, and not our code at all. Plus, the XML >> import code really does not filter (nor should it), since the goal of the >> XML->relational database step is to fully capture the XML data in the >> relational database. So, XML count is necessarily separated from XML >> import. >> > >>> >> > >>> - The notion of a declarative mechanism for extracting IDs from the >> relational database (which is what gmbuilder.properties/TallyEngine uses) is >> interesting, but at the same time there is value in the arbitrary >> computation that can be done with Java (case in point: export two versions >> of an ID, with and without periods). This is not to say that it is >> impossible to do this declaratively, but let's just say that the procedural >> approach exists here and now, and a declarative approach will need more >> thought. >> > >>> >> > >>> These, and other factors, are good thoughts to hold onto and would >> be worthy of a good meeting discussion sometime, but bottom line for now: >> modifying the export behavior is a matter of editing the *SpeciesProfile* >> Java code, and not the gmbuilder.properties file. Turn your attention to >> that code. >> > >>> >> > >>> Now, as to annotating your code...I'd just put in code comments :) >> Or did you mean something else by tagging code in e-mail? >> > >>> >> > >>> John David N. Dionisio, PhD >> > >>> Associate Professor, Computer Science >> > >>> Loyola Marymount University >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> > >>> >> > >>> > also, how do I tag code in email so it holds its formatting? I >> tried a few suggestions I found on the web but they aren't holding >> formatting or i'm just doing it wrong ;-D >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous < >> rbr...@gm...> wrote: >> > >>> > OK, have some updates and some suggestions: >> > >>> > >> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene >> testing report. We verified that XML match does indeed find 4066 unique >> matches - 75 of which are not in the gdb and need to be. >> > >>> > >> > >>> > Dr. Dahlquist informed me that she was the one who completed the >> gene db testing report, not a previous student of BIO367 and had already >> verified which genes were missing and where they were to be found. I had >> (mistakenly) assumed that since a student had performed the gene database >> testing I had to redo all of the verification. >> > >>> > >> > >>> > So that said, of the 75 genes missing - 69 need to be included and >> 6 excluded. >> > >>> > Per the gene db testing report: "69 of them have an "a", "b", or >> "d" suffix. They are all found in the ORF tag and need to be included in the >> gdb." >> > >>> > >> > >>> > To solve this we need to add additional search criteria into the >> M. tuberculosis section in gmbuilder.properties below: >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount= >> > >>> > >> > >>> > 1 >> > >>> > >> > >>> > mycobacteriumtuberculosis_element_level0= >> > >>> > >> > >>> > uniprot/entry/gene/name&type&ordered locus >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered locus' and >> value like 'Rv%'; >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > SOLUTIONS: >> > >>> > >> > >>> > 1. So am i correct in my understanding that the second line is the >> query used by TallyEngine to read the XML file? If so then this is the issue >> we need to table for the moment until we get the gbd verified and >> re-released. We will revisit this to discover why it is not only reporting >> incorrectly but also why its added a second row of Ordered Locus on the >> TallyEngine results page. >> > >>> > >> > >>> > 2. The third line is the SQL query used by postgres during the >> export from XML to gdb. To find and get the ORF tagged genes could we not >> add the following lines and change the count in the first line: >> > >>> > >> > >>> > >> > >>> > # Mycobacterium tuberculosis >> > >>> > >> > >>> > mycobacteriumtuberculosis_level_amount=2 >> > >>> > >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_query_level0= >> > >>> > >> > >>> > select count(*) from genenametype where type = 'ordered locus'; >> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from >> genenametype where type = 'orf'; >> > >>> > >> > >>> > >> > >>> > mycobacteriumtuberculosis_table_name_level0= >> > >>> > >> > >>> > Ordered Locus >> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >>> > >> > >>> > >> ---------------------------------------------------------------------------------------------------------------------------- >> > >>> > >> > >>> > Of course these queries would have be manually verified prior to >> making these changes but this seems like we are moving in the right >> direction. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous < >> rbr...@gm...> wrote: >> > >>> > Just got done reading previous email and understand the change in >> priority. >> > >>> > >> > >>> > Will work on the missing ID's for now and shelve the the >> TalleyEngine issue for the moment. >> > >>> > >> > >>> > Also great about a more formalized weekly meeting. I was going to >> suggest it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> > >>> > >> > >>> > Will dig further into the missing ID's later tonight and during >> day tomorrow and report back. >> > >>> > >> > >>> > Richard >> > >>> > >> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > Hi Rich, >> > >>> > >> > >>> > No problem. The pertinent line you're referring to, for XML, is >> this, right above the line you copied: >> > >>> > >> > >>> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > >> > >>> > The slash-separated section is the "path" of XML tags leading to >> the element of interest; then, after the ampersand, is a name/value pair for >> the desired attribute to count. Note that there is no hint of a >> *content*-based filter (nor is there the capability for one, as far as I can >> tell in the code). By "content," I mean that we can't specify filters based >> on what's *between* the tags. We can only go as far as filter by attribute >> value, e.g., type="ordered locus". >> > >>> > >> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the >> missing IDs in the .gdb take precedence for now. Please take a look at the >> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >> the ID output can be customized, then let me know if you have any questions >> or need to confirm anything. >> > >>> > >> > >>> > John David N. Dionisio, PhD >> > >>> > Associate Professor, Computer Science >> > >>> > Loyola Marymount University >> > >>> > >> > >>> > >> > >>> > >> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >>> > >> > >>> > > Sorry been slammed with a programming assignment that kept >> needing continued iteration and it has been all consuming until last night. >> But I did get a chance to work with your comments and review the code again >> with a different mind set. >> > >>> > > >> > >>> > > Yes, I examined the gmbuilder.properties file ( the query is >> also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned >> in a previous email ) but I don't think I see what you mean regarding the >> XML count. >> > >>> > > >> > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... or do they >> share the same query and you are saying that XML count doesn't recognize and >> therefore cannot use the 'Rv%' parameter? >> > >>> > > >> > >>> > > Richard >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > Hi Rich, >> > >>> > > >> > >>> > > Sorry for the delay. Had some distractions coming into the >> weekend. >> > >>> > > >> > >>> > > You've looked at the code; have you looked at >> gmbuilder.properties? (I may have mentioned it a few e-mails ago, just as >> you were starting to dig into this) >> > >>> > > >> > >>> > > On the copy I have, the M. tuberculosis block looks like this >> (indentation is mine to set it apart): >> > >>> > > >> > >>> > > # Mycobacterium tuberculosis >> > >>> > > mycobacteriumtuberculosis_level_amount=1 >> > >>> > > >> > >>> > > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_query_level0=select count(*) >> from genenametype where type = 'ordered locus' and value like 'Rv%'; >> > >>> > > >> > >>> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > >>> > > >> > >>> > > There, I think, is the rub. Notice that the XML count does not >> filter on RV%. The SQL query does. >> > >>> > > >> > >>> > > Unfortunately, I don't think the TallyEngine can include >> selective filtering in the XML counts. If the need to do selective >> filtering on XML is necessary, then I think we're looking at a new >> functionality for you to implement (or, if this throws things off too much, >> this may have to be noted somewhere, that the XML vs. database counts may be >> off because the database count is doing some text-based filtering but the >> XML count does not). >> > >>> > > >> > >>> > > What does xmlpipedb-match say? That will at least tell you >> whether the 'RV%' count is indeed correct. >> > >>> > > >> > >>> > > John David N. Dionisio, PhD >> > >>> > > Associate Professor, Computer Science >> > >>> > > Loyola Marymount University >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > >>> > > >> > >>> > > > OK here is what I was able to put together from the past few >> hours of code review: >> > >>> > > > >> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > >>> > > > -reveals that after the 2 System table modifications are made >> adding species name and link, a PreparedStatement is instantiated which >> builds and calls the base query. >> > >>> > > > >> > >>> > > > -The base query called is: ("SELECT value, type " + "FROM >> genenametype INNER JOIN entrytype_genetype " + >> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >> > >>> > > > >> > >>> > > > -So its looking in 'ordered locus' table/column for any tuple >> that starts with Rv (followed by any substring) and entrytype_gene_hjid = ? >> . >> > >>> > > > The 'like' comparator and % usage are clear with the 'type' >> entrytype_gene_hjid = ? >> > >>> > > > >> > >>> > > > -To me it seems the query makes sense so the problem is likely >> elsewhere. >> > >>> > > > >> > >>> > > > GenMappBuilder.java: >> > >>> > > > -I found method doTallies() at code line 895 which: >> > >>> > > > Instantiates a Configuration called hibernateConfiguration and >> assigns to it the current hibernate configuration >> > >>> > > > Validates database settings by analyzing >> hibernateConfiguration >> > >>> > > > Instantiates a CriterionList for uniprot and assigns to it >> TallyType.UNIPROT >> > >>> > > > Instantiates a CriterionList for go and assigns to it >> TallyType.GO >> > >>> > > > Determines if both xml files exist >> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on >> both xml files and their respective CriterionList >> > >>> > > > Results are then formatted for display in a table. >> > >>> > > > >> > >>> > > > -So enum TallyType which means that they are the only valid >> datatypes which TallyEngine accepts... go to know ... >> > >>> > > > >> > >>> > > > -Based on the screen shot of Tally Engine it would seem that >> both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >> returning. Likely due to both using an incorrect query (as we previously >> supposed). But where are the queries?... the more I dig the more I think >> they are in the criterial all the work is done against. >> > >>> > > > >> > >>> > > > continuing the review: >> > >>> > > > getTallyResultsXML() calls Tally Engine instance method >> getXmlFileCounts(xmlFile) >> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method >> getDbcounts(new QueryEngine(hibernateConfiguration) >> > >>> > > > Both of these instanced methods originate from >> TallyEngine.java... >> > >>> > > > >> > >>> > > > TallyEngine.java: >> > >>> > > > >> > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >> digester then processes against criteria... but this quickly becomes >> confusing and is hard to follow >> > >>> > > > >> > >>> > > > getDbcounts() then starts a db session and executes a query >> but then I also get a bit lost with my limited db knowledge. >> > >>> > > > >> > >>> > > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > >>> > > > >> > >>> > > > OVERALL I think I'm getting closer to the issues but I still >> feel as if I'm missing some understanding to proceed further. Can you pass >> along some of that Dondi insight and steer me in the right direction? =D >> > >>> > > > >> > >>> > > > -DB Tally - Not having taken databases yet certainly is >> limiting my ability determine where the "criteria" are being set and how >> they are followed during session activities. Also is the query we have been >> looking for this whole time in the criteria or someplace else? >> > >>> > > > >> > >>> > > > -XML Tally - again is the query contained within the criteria >> that digestXmlFile() uses to parse? >> > >>> > > > >> > >>> > > > Richard >> > >>> > > > >> > >>> > > > >> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > > Right, schema issues are unlikely. Most count discrepancies >> like this that I've seen have boiled down to forming the right query. Then, >> knowing the right query (in both XML and SQL), it's a matter of making sure >> that TallyEngine asks that same query. >> > >>> > > > >> > >>> > > > John David N. Dionisio, PhD >> > >>> > > > Associate Professor, Computer Science >> > >>> > > > Loyola Marymount University >> > >>> > > > >> > >>> > > > >> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > >>> > > > >> > >>> > > > > OK, so based on your approach: >> > >>> > > > > >> > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match >> and sql queries needed for the respective results as you requested. >> > >>> > > > > >> > >>> > > > > I was also thinking I may need to review the schema from xml >> into postgres but the issue isn't likely a schema error. The error most >> likely lies in how xmlpipedbutils queries the data from xml source and >> writes to the tables what it returns? >> > >>> > > > > >> > >>> > > > > 2. I'll review the code: trace the entrance of tally engine >> in the gmbuilder code then follow it through the xmlpipedbutils. >> > >>> > > > > >> > >>> > > > > Richard >> > >>> > > > > >> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >> do...@lm...> wrote: >> > >>> > > > > Just wanted to confirm (since I wasn't sure in the first >> e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils >> in SourceForge's Subversion repo. >> > >>> > > > > >> > >>> > > > > John David N. Dionisio, PhD >> > >>> > > > > Associate Professor, Computer Science >> > >>> > > > > Loyola Marymount University >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > >>> > > > > >> > >>> > > > > > Hi Dondi, >> > >>> > > > > > >> > >>> > > > > > So I'm at the point in working with M tuberculosis that I >> was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine >> results. >> > >>> > > > > > >> > >>> > > > > > gmb2b60 Results >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > Now the proverbial question - What next to solve the >> Ordered Locus import/count issue? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > Here is my thought process: >> > >>> > > > > > >> > >>> > > > > > Step 1: How does the import process work at the high >> level? (obviously correct me if I'm wrong) >> > >>> > > > > > >> > >>> > > > > > I believe that basically as each XML tag is read, it is >> placed in the proper Postgres table(s) based on some criteria. There is also >> likely some sort of check that each individual tag is in valid XML format >> unless we don't care at this stage (care at export) or maybe the parser just >> skips over and goes on to the next . >> > >>> > > > > > >> > >>> > > > > > Step 2: What could be the problem? >> > >>> > > > > > >> > >>> > > > > > Either - >> > >>> > > > > > a. XML tags are being parsed incorrectly >> (ignored/skipped)? >> > >>> > > > > > b. Decision criteria of which table they should be added >> to? >> > >>> > > > > > >> > >>> > > > > > ********************************************** >> > >>> > > > > > >> > >>> > > > > > I read on the sourceforge wiki: >> > >>> > > > > > >> > >>> > > > > > XMLPipeDB has a modular architecture with three components >> that may be used separately or together. XSD-to-DB reads an XSD (XML Schema >> Definition) and automatically generates an SQL schema, Java classes, and >> Hibernate mappings. XMLPipeDB Utilities provides functionality for >> configuring the database, importing data, and performing queries. GenMAPP >> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >> Gene Databases based on data from UniProt and Gene Ontology (GO). >> > >>> > > > > > >> > >>> > > > > > So I should probably start with the XMLPipeDB Utilities >> which are where? I don't see any in the basic distribution or are they not >> standalone and called from the command line? >> > >>> > > > > > >> > >>> > > > > > Thanks! >> > >>> > > > > > >> > >>> > > > > > Richard >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > <ATT00001..txt><ATT00002..txt> >> > >>> > > > >> > >>> > > > >> > >>> > > > >> ------------------------------------------------------------------------------ >> > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > > > Pinpoint memory and threading errors before they happen. >> > >>> > > > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > > > _______________________________________________ >> > >>> > > > xmlpipedb-developer mailing list >> > >>> > > > xml...@li... >> > >>> > > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > > > >> > >>> > > > <ATT00001..txt><ATT00002..txt> >> > >>> > > >> > >>> > > >> > >>> > > >> ------------------------------------------------------------------------------ >> > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > > Pinpoint memory and threading errors before they happen. >> > >>> > > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > > _______________________________________________ >> > >>> > > xmlpipedb-developer mailing list >> > >>> > > xml...@li... >> > >>> > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > > >> > >>> > > <ATT00001..txt><ATT00002..txt> >> > >>> > >> > >>> > >> > >>> > >> ------------------------------------------------------------------------------ >> > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> > Pinpoint memory and threading errors before they happen. >> > >>> > Find and fix more than 250 security defects in the development >> cycle. >> > >>> > Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> > http://p.sf.net/sfu/intel-dev2devfeb >> > >>> > _______________________________________________ >> > >>> > xmlpipedb-developer mailing list >> > >>> > xml...@li... >> > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> > >> > >>> > >> > >>> > >> > >>> > <ATT00001..txt><ATT00002..txt> >> > >>> >> > >>> >> > >>> >> ------------------------------------------------------------------------------ >> > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > >>> Pinpoint memory and threading errors before they happen. >> > >>> Find and fix more than 250 security defects in the development >> cycle. >> > >>> Locate bottlenecks in serial and parallel code that limit >> performance. >> > >>> http://p.sf.net/sfu/intel-dev2devfeb >> > >>> _______________________________________________ >> > >>> xmlpipedb-developer mailing list >> > >>> xml...@li... >> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >>> >> > >>> >> > >>> >> > >>> >> ------------------------------------------------------------------------------ >> > >>> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >>> Collect, index and harness all the fast moving IT data generated by >> your >> > >>> applications, servers and devices whether physical, virtual or in >> the cloud. >> > >>> Deliver compliance at lower cost and gain new business insights. >> > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >>> _______________________________________________ >> > >>> xmlpipedb-developer mailing list >> > >>> xml...@li... >> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> >> > >> >> ------------------------------------------------------------------------------ >> > >> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >> Collect, index and harness all the fast moving IT data generated by >> your >> > >> applications, servers and devices whether physical, virtual or in the >> cloud. >> > >> Deliver compliance at lower cost and gain new business insights. >> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >> _______________________________________________ >> > >> xmlpipedb-developer mailing list >> > >> xml...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> >> > >> >> > >> >> ------------------------------------------------------------------------------ >> > >> Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > >> Collect, index and harness all the fast moving IT data generated by >> your >> > >> applications, servers and devices whether physical, virtual or in the >> cloud. >> > >> Deliver compliance at lower cost and gain new business insights. >> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > >> _______________________________________________ >> > >> xmlpipedb-developer mailing list >> > >> xml...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > >> ------------------------------------------------------------------------------ >> > > Index, Search & Analyze Logs and other IT data in Real-Time with >> Splunk >> > > Collect, index and harness all the fast moving IT data generated by >> your >> > > applications, servers and devices whether physical, virtual or in the >> cloud. >> > > Deliver compliance at lower cost and gain new business insights. >> > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > >> ------------------------------------------------------------------------------ >> > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> > Collect, index and harness all the fast moving IT data generated by your >> > applications, servers and devices whether physical, virtual or in the >> cloud. >> > Deliver compliance at lower cost and gain new business insights. >> > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> >> ------------------------------------------------------------------------------ >> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk >> Collect, index and harness all the fast moving IT data generated by your >> applications, servers and devices whether physical, virtual or in the >> cloud. >> Deliver compliance at lower cost and gain new business insights. >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > |