Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-02-21 23:56:37
|
So here is the appropriate code snippet from MycobacteriumTuberculosisUniProtSpeciesProfile.java: * public *TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) *throws* SQLException, InvalidParameterException { // Build the base query; we only use "ordered locus" and we only want // IDs that begin with "*Rv*." PreparedStatement ps = ConnectionManager.*getRelationalDBConnection* ().prepareStatement( "SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype " + "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?"); ResultSet result; *for* (Row row : primarySystemTableManager.getRows()) { ps.setInt(1, Integer.*parseInt*(row.getValue( "UID"))); result = ps.executeQuery(); // We actually want to keep the case where multiple ordered locus // names appear. *while* (result.next()) { // We want this name to appear in the OrderedLocusNames // system table. *for* (String id : result.getString("value").split("/")) { tableManager.submit( "OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID", row.getValue("UID") } }); } } } ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- So now we want to build the base query which uses "ordered locus" and "orf" and we only want IDs that begin with "Rv". I know there are more comprehensive ways to search for gene ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell it to search for XML tag type orf in addition to ordered locus. "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and entrytype_gene_hjid = ? " Here is a stab at it.... This part of our class was right as the server went down and my submission for week 6 assignment I can't seem to find. Is it possible to have two different types in the same query or should we rewrite a separate query for the orf tag? Richard On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > thanks and will do as directed. > > My previous, last paragraph comment - A way for programming code in email > holding its format in a mail message similarly to how you can post code on > forum pages? > > <code> > blah > blah > blah > </code> > > thanks! > > Richard > > On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...>wrote: > > >> Greetings, >> >> Actually, gmbuilder.properties is for the TallyEngine only. When dealing >> with .gdb exports, look *only* at the SpeciesProfile class. So, to find >> those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile >> code, that needs to be changed. >> >> Your take on how gmbuilder.properties is used, however, is understandable. >> It makes sense to assume that the TallyEngine code *and* the ID export code >> are based on the same characterization of the needed IDs. This replication >> is originally a historical artifact: SpeciesProfile was done first, and then >> TallyEngine was done later by another student. >> >> However, there are other factors beyond history that sort of necessitate >> this duplication of desired IDs: (skip the two bullets below if you'd rather >> cut to the chase of the work to be done, and discuss design issues later) >> >> - The actual XML import code is a black box: this is the "canned" JAXB >> library actually in action, and not our code at all. Plus, the XML import >> code really does not filter (nor should it), since the goal of the >> XML->relational database step is to fully capture the XML data in the >> relational database. So, XML count is necessarily separated from XML >> import. >> >> - The notion of a declarative mechanism for extracting IDs from the >> relational database (which is what gmbuilder.properties/TallyEngine uses) is >> interesting, but at the same time there is value in the arbitrary >> computation that can be done with Java (case in point: export two versions >> of an ID, with and without periods). This is not to say that it is >> impossible to do this declaratively, but let's just say that the procedural >> approach exists here and now, and a declarative approach will need more >> thought. >> >> These, and other factors, are good thoughts to hold onto and would be >> worthy of a good meeting discussion sometime, but bottom line for now: >> modifying the export behavior is a matter of editing the *SpeciesProfile* >> Java code, and not the gmbuilder.properties file. Turn your attention to >> that code. >> >> Now, as to annotating your code...I'd just put in code comments :) Or >> did you mean something else by tagging code in e-mail? >> >> John David N. Dionisio, PhD >> Associate Professor, Computer Science >> Loyola Marymount University >> >> >> >> >> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote: >> >> > also, how do I tag code in email so it holds its formatting? I tried a >> few suggestions I found on the web but they aren't holding formatting or i'm >> just doing it wrong ;-D >> > >> > Richard >> > >> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> >> wrote: >> > OK, have some updates and some suggestions: >> > >> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing >> report. We verified that XML match does indeed find 4066 unique matches - 75 >> of which are not in the gdb and need to be. >> > >> > Dr. Dahlquist informed me that she was the one who completed the gene db >> testing report, not a previous student of BIO367 and had already verified >> which genes were missing and where they were to be found. I had (mistakenly) >> assumed that since a student had performed the gene database testing I had >> to redo all of the verification. >> > >> > So that said, of the 75 genes missing - 69 need to be included and 6 >> excluded. >> > Per the gene db testing report: "69 of them have an "a", "b", or "d" >> suffix. They are all found in the ORF tag and need to be included in the >> gdb." >> > >> > To solve this we need to add additional search criteria into the M. >> tuberculosis section in gmbuilder.properties below: >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount= >> > >> > 1 >> > >> > mycobacteriumtuberculosis_element_level0= >> > >> > uniprot/entry/gene/name&type&ordered locus >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus' and value >> like 'Rv%'; >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > SOLUTIONS: >> > >> > 1. So am i correct in my understanding that the second line is the query >> used by TallyEngine to read the XML file? If so then this is the issue we >> need to table for the moment until we get the gbd verified and re-released. >> We will revisit this to discover why it is not only reporting incorrectly >> but also why its added a second row of Ordered Locus on the TallyEngine >> results page. >> > >> > 2. The third line is the SQL query used by postgres during the export >> from XML to gdb. To find and get the ORF tagged genes could we not add the >> following lines and change the count in the first line: >> > >> > >> > # Mycobacterium tuberculosis >> > >> > mycobacteriumtuberculosis_level_amount=2 >> > >> > >> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >> > >> > >> > mycobacteriumtuberculosis_query_level0= >> > >> > select count(*) from genenametype where type = 'ordered locus'; >> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype >> where type = 'orf'; >> > >> > >> > mycobacteriumtuberculosis_table_name_level0= >> > >> > Ordered Locus >> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus >> > >> > >> ---------------------------------------------------------------------------------------------------------------------------- >> > >> > Of course these queries would have be manually verified prior to making >> these changes but this seems like we are moving in the right direction. >> > >> > Richard >> > >> > >> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> >> wrote: >> > Just got done reading previous email and understand the change in >> priority. >> > >> > Will work on the missing ID's for now and shelve the the TalleyEngine >> issue for the moment. >> > >> > Also great about a more formalized weekly meeting. I was going to >> suggest it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> > >> > Will dig further into the missing ID's later tonight and during day >> tomorrow and report back. >> > >> > Richard >> > >> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...> >> wrote: >> > Hi Rich, >> > >> > No problem. The pertinent line you're referring to, for XML, is this, >> right above the line you copied: >> > >> > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > >> > The slash-separated section is the "path" of XML tags leading to the >> element of interest; then, after the ampersand, is a name/value pair for the >> desired attribute to count. Note that there is no hint of a *content*-based >> filter (nor is there the capability for one, as far as I can tell in the >> code). By "content," I mean that we can't specify filters based on what's >> *between* the tags. We can only go as far as filter by attribute value, >> e.g., type="ordered locus". >> > >> > But anyway, as mentioned in the earlier e-mail, let's have the missing >> IDs in the .gdb take precedence for now. Please take a look at the >> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >> the ID output can be customized, then let me know if you have any questions >> or need to confirm anything. >> > >> > John David N. Dionisio, PhD >> > Associate Professor, Computer Science >> > Loyola Marymount University >> > >> > >> > >> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >> > >> > > Sorry been slammed with a programming assignment that kept needing >> continued iteration and it has been all consuming until last night. But I >> did get a chance to work with your comments and review the code again with a >> different mind set. >> > > >> > > Yes, I examined the gmbuilder.properties file ( the query is also in >> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >> previous email ) but I don't think I see what you mean regarding the XML >> count. >> > > >> > > I understood that: mycobacteriumtuberculosis_query_level0=select >> count(*) from genenametype where type = 'ordered locus' and value like >> 'Rv%'; was the db query but don't see which is the XML count... or do they >> share the same query and you are saying that XML count doesn't recognize and >> therefore cannot use the 'Rv%' parameter? >> > > >> > > Richard >> > > >> > > >> > > >> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > Hi Rich, >> > > >> > > Sorry for the delay. Had some distractions coming into the weekend. >> > > >> > > You've looked at the code; have you looked at gmbuilder.properties? >> (I may have mentioned it a few e-mails ago, just as you were starting to >> dig into this) >> > > >> > > On the copy I have, the M. tuberculosis block looks like this >> (indentation is mine to set it apart): >> > > >> > > # Mycobacterium tuberculosis >> > > mycobacteriumtuberculosis_level_amount=1 >> > > >> > > >> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >> locus >> > > >> > > mycobacteriumtuberculosis_query_level0=select count(*) from >> genenametype where type = 'ordered locus' and value like 'Rv%'; >> > > >> > > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >> > > >> > > There, I think, is the rub. Notice that the XML count does not filter >> on RV%. The SQL query does. >> > > >> > > Unfortunately, I don't think the TallyEngine can include selective >> filtering in the XML counts. If the need to do selective filtering on XML >> is necessary, then I think we're looking at a new functionality for you to >> implement (or, if this throws things off too much, this may have to be noted >> somewhere, that the XML vs. database counts may be off because the database >> count is doing some text-based filtering but the XML count does not). >> > > >> > > What does xmlpipedb-match say? That will at least tell you whether >> the 'RV%' count is indeed correct. >> > > >> > > John David N. Dionisio, PhD >> > > Associate Professor, Computer Science >> > > Loyola Marymount University >> > > >> > > >> > > >> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >> > > >> > > > OK here is what I was able to put together from the past few hours >> of code review: >> > > > >> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >> > > > -reveals that after the 2 System table modifications are made adding >> species name and link, a PreparedStatement is instantiated which builds and >> calls the base query. >> > > > >> > > > -The base query called is: ("SELECT value, type " + "FROM >> genenametype INNER JOIN entrytype_genetype " + >> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >> > > > >> > > > -So its looking in 'ordered locus' table/column for any tuple that >> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >> > > > The 'like' comparator and % usage are clear with the 'type' >> entrytype_gene_hjid = ? >> > > > >> > > > -To me it seems the query makes sense so the problem is likely >> elsewhere. >> > > > >> > > > GenMappBuilder.java: >> > > > -I found method doTallies() at code line 895 which: >> > > > Instantiates a Configuration called hibernateConfiguration and >> assigns to it the current hibernate configuration >> > > > Validates database settings by analyzing hibernateConfiguration >> > > > Instantiates a CriterionList for uniprot and assigns to it >> TallyType.UNIPROT >> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO >> > > > Determines if both xml files exist >> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >> xml files and their respective CriterionList >> > > > Results are then formatted for display in a table. >> > > > >> > > > -So enum TallyType which means that they are the only valid >> datatypes which TallyEngine accepts... go to know ... >> > > > >> > > > -Based on the screen shot of Tally Engine it would seem that both >> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >> returning. Likely due to both using an incorrect query (as we previously >> supposed). But where are the queries?... the more I dig the more I think >> they are in the criterial all the work is done against. >> > > > >> > > > continuing the review: >> > > > getTallyResultsXML() calls Tally Engine instance method >> getXmlFileCounts(xmlFile) >> > > > getTallyResultsDatabase() calls Tally Engine instance method >> getDbcounts(new QueryEngine(hibernateConfiguration) >> > > > Both of these instanced methods originate from TallyEngine.java... >> > > > >> > > > TallyEngine.java: >> > > > >> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a >> digester then processes against criteria... but this quickly becomes >> confusing and is hard to follow >> > > > >> > > > getDbcounts() then starts a db session and executes a query but then >> I also get a bit lost with my limited db knowledge. >> > > > >> > > > >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > > > >> > > > OVERALL I think I'm getting closer to the issues but I still feel as >> if I'm missing some understanding to proceed further. Can you pass along >> some of that Dondi insight and steer me in the right direction? =D >> > > > >> > > > -DB Tally - Not having taken databases yet certainly is limiting my >> ability determine where the "criteria" are being set and how they are >> followed during session activities. Also is the query we have been looking >> for this whole time in the criteria or someplace else? >> > > > >> > > > -XML Tally - again is the query contained within the criteria that >> digestXmlFile() uses to parse? >> > > > >> > > > Richard >> > > > >> > > > >> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >> do...@lm...> wrote: >> > > > Right, schema issues are unlikely. Most count discrepancies like >> this that I've seen have boiled down to forming the right query. Then, >> knowing the right query (in both XML and SQL), it's a matter of making sure >> that TallyEngine asks that same query. >> > > > >> > > > John David N. Dionisio, PhD >> > > > Associate Professor, Computer Science >> > > > Loyola Marymount University >> > > > >> > > > >> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >> > > > >> > > > > OK, so based on your approach: >> > > > > >> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >> sql queries needed for the respective results as you requested. >> > > > > >> > > > > I was also thinking I may need to review the schema from xml into >> postgres but the issue isn't likely a schema error. The error most likely >> lies in how xmlpipedbutils queries the data from xml source and writes to >> the tables what it returns? >> > > > > >> > > > > 2. I'll review the code: trace the entrance of tally engine in the >> gmbuilder code then follow it through the xmlpipedbutils. >> > > > > >> > > > > Richard >> > > > > >> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >> do...@lm...> wrote: >> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >> SourceForge's Subversion repo. >> > > > > >> > > > > John David N. Dionisio, PhD >> > > > > Associate Professor, Computer Science >> > > > > Loyola Marymount University >> > > > > >> > > > > >> > > > > >> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >> > > > > >> > > > > > Hi Dondi, >> > > > > > >> > > > > > So I'm at the point in working with M tuberculosis that I was >> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >> > > > > > >> > > > > > gmb2b60 Results >> > > > > > >> > > > > > >> > > > > > >> > > > > > Now the proverbial question - What next to solve the Ordered >> Locus import/count issue? >> > > > > > >> > > > > > ********************************************** >> > > > > > Here is my thought process: >> > > > > > >> > > > > > Step 1: How does the import process work at the high level? >> (obviously correct me if I'm wrong) >> > > > > > >> > > > > > I believe that basically as each XML tag is read, it is placed >> in the proper Postgres table(s) based on some criteria. There is also likely >> some sort of check that each individual tag is in valid XML format unless we >> don't care at this stage (care at export) or maybe the parser just skips >> over and goes on to the next . >> > > > > > >> > > > > > Step 2: What could be the problem? >> > > > > > >> > > > > > Either - >> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >> > > > > > b. Decision criteria of which table they should be added to? >> > > > > > >> > > > > > ********************************************** >> > > > > > >> > > > > > I read on the sourceforge wiki: >> > > > > > >> > > > > > XMLPipeDB has a modular architecture with three components that >> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >> Definition) and automatically generates an SQL schema, Java classes, and >> Hibernate mappings. XMLPipeDB Utilities provides functionality for >> configuring the database, importing data, and performing queries. GenMAPP >> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >> Gene Databases based on data from UniProt and Gene Ontology (GO). >> > > > > > >> > > > > > So I should probably start with the XMLPipeDB Utilities which >> are where? I don't see any in the basic distribution or are they not >> standalone and called from the command line? >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > Richard >> > > > > >> > > > > >> > > > > <ATT00001..txt><ATT00002..txt> >> > > > >> > > > >> > > > >> ------------------------------------------------------------------------------ >> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel >> Studio XE: >> > > > Pinpoint memory and threading errors before they happen. >> > > > Find and fix more than 250 security defects in the development >> cycle. >> > > > Locate bottlenecks in serial and parallel code that limit >> performance. >> > > > http://p.sf.net/sfu/intel-dev2devfeb >> > > > _______________________________________________ >> > > > xmlpipedb-developer mailing list >> > > > xml...@li... >> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > > >> > > > <ATT00001..txt><ATT00002..txt> >> > > >> > > >> > > >> ------------------------------------------------------------------------------ >> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > > Pinpoint memory and threading errors before they happen. >> > > Find and fix more than 250 security defects in the development cycle. >> > > Locate bottlenecks in serial and parallel code that limit performance. >> > > http://p.sf.net/sfu/intel-dev2devfeb >> > > _______________________________________________ >> > > xmlpipedb-developer mailing list >> > > xml...@li... >> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > > >> > > <ATT00001..txt><ATT00002..txt> >> > >> > >> > >> ------------------------------------------------------------------------------ >> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> > Pinpoint memory and threading errors before they happen. >> > Find and fix more than 250 security defects in the development cycle. >> > Locate bottlenecks in serial and parallel code that limit performance. >> > http://p.sf.net/sfu/intel-dev2devfeb >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > xml...@li... >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > >> > >> > <ATT00001..txt><ATT00002..txt> >> >> >> >> ------------------------------------------------------------------------------ >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > > |