Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-02-21 05:38:09
|
also, how do I tag code in email so it holds its formatting? I tried a few suggestions I found on the web but they aren't holding formatting or i'm just doing it wrong ;-D Richard On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote: > OK, have some updates and some suggestions: > > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing > report. We verified that XML match does indeed find 4066 unique matches - 75 > of which are not in the gdb and need to be. > > Dr. Dahlquist informed me that she was the one who completed the gene db > testing report, not a previous student of BIO367 and had already verified > which genes were missing and where they were to be found. I had (mistakenly) > assumed that since a student had performed the gene database testing I had > to redo all of the verification. > > So that said, of the 75 genes missing - 69 need to be included and 6 > excluded. > Per the gene db testing report: "69 of them have an "a", "b", or "d" > suffix. They are all found in the ORF tag and need to be included in the > gdb." > > To solve this we need to add additional search criteria into the M. > tuberculosis section in gmbuilder.properties below: > > # *Mycobacterium* tuberculosis > > mycobacteriumtuberculosis_level_amount= > 1 > > mycobacteriumtuberculosis_element_level0= > *uniprot*/entry/gene/*name&type&ordered* locus > > mycobacteriumtuberculosis_query_level0= > select count(*) from *genenametype* where type = 'ordered locus' and value > like '*Rv*%'; > > mycobacteriumtuberculosis_table_name_level0= > Ordered Locus > SOLUTIONS: > > 1. So am i correct in my understanding that the second line is the query > used by TallyEngine to read the XML file? If so then this is the issue we > need to table for the moment until we get the gbd verified and re-released. > We will revisit this to discover why it is not only reporting incorrectly > but also why its added a second row of Ordered Locus on the TallyEngine > results page. > > 2. The third line is the SQL query used by postgres during the export from > XML to gdb. To find and get the ORF tagged genes could we not add the > following lines and change the count in the first line: > > > > # *Mycobacterium* tuberculosis > > mycobacteriumtuberculosis_level_amount=2 > > mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/* > name&type&ordered* locus > > mycobacteriumtuberculosis_element_level1=*uniprot*/entry/gene/* > name&type&orf* > > mycobacteriumtuberculosis_query_level0= > select count(*) from *genenametype* where type = 'ordered locus'; > > mycobacteriumtuberculosis_query_level1=select count(*) from *genenametype* > where type = 'orf'; > > mycobacteriumtuberculosis_table_name_level0= > Ordered Locus > > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > > ---------------------------------------------------------------------------------------------------------------------------- > > Of course these queries would have be manually verified prior to making > these changes but this seems like we are moving in the right direction. > > Richard > > > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >> Just got done reading previous email and understand the change in >> priority. >> >> Will work on the missing ID's for now and shelve the the TalleyEngine >> issue for the moment. >> >> Also great about a more formalized weekly meeting. I was going to suggest >> it myself as it has been slow going so far as maybe i'm a bit too >> independent in this independent study class =D >> >> Will dig further into the missing ID's later tonight and during day >> tomorrow and report back. >> >> Richard >> >> On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote: >> >>> Hi Rich, >>> >>> No problem. The pertinent line you're referring to, for XML, is this, >>> right above the line you copied: >>> >>> >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> >>> The slash-separated section is the "path" of XML tags leading to the >>> element of interest; then, after the ampersand, is a name/value pair for the >>> desired attribute to count. Note that there is no hint of a *content*-based >>> filter (nor is there the capability for one, as far as I can tell in the >>> code). By "content," I mean that we can't specify filters based on what's >>> *between* the tags. We can only go as far as filter by attribute value, >>> e.g., type="ordered locus". >>> >>> But anyway, as mentioned in the earlier e-mail, let's have the missing >>> IDs in the .gdb take precedence for now. Please take a look at the >>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how >>> the ID output can be customized, then let me know if you have any questions >>> or need to confirm anything. >>> >>> John David N. Dionisio, PhD >>> Associate Professor, Computer Science >>> Loyola Marymount University >>> >>> >>> >>> On Feb 17, 2011, at 3:04 PM, Richard Brous wrote: >>> >>> > Sorry been slammed with a programming assignment that kept needing >>> continued iteration and it has been all consuming until last night. But I >>> did get a chance to work with your comments and review the code again with a >>> different mind set. >>> > >>> > Yes, I examined the gmbuilder.properties file ( the query is also in >>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a >>> previous email ) but I don't think I see what you mean regarding the XML >>> count. >>> > >>> > I understood that: mycobacteriumtuberculosis_query_level0=select >>> count(*) from genenametype where type = 'ordered locus' and value like >>> 'Rv%'; was the db query but don't see which is the XML count... or do they >>> share the same query and you are saying that XML count doesn't recognize and >>> therefore cannot use the 'Rv%' parameter? >>> > >>> > Richard >>> > >>> > >>> > >>> > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > Hi Rich, >>> > >>> > Sorry for the delay. Had some distractions coming into the weekend. >>> > >>> > You've looked at the code; have you looked at gmbuilder.properties? (I >>> may have mentioned it a few e-mails ago, just as you were starting to dig >>> into this) >>> > >>> > On the copy I have, the M. tuberculosis block looks like this >>> (indentation is mine to set it apart): >>> > >>> > # Mycobacterium tuberculosis >>> > mycobacteriumtuberculosis_level_amount=1 >>> > >>> > >>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered >>> locus >>> > >>> > mycobacteriumtuberculosis_query_level0=select count(*) from >>> genenametype where type = 'ordered locus' and value like 'Rv%'; >>> > >>> > mycobacteriumtuberculosis_table_name_level0=Ordered Locus >>> > >>> > There, I think, is the rub. Notice that the XML count does not filter >>> on RV%. The SQL query does. >>> > >>> > Unfortunately, I don't think the TallyEngine can include selective >>> filtering in the XML counts. If the need to do selective filtering on XML >>> is necessary, then I think we're looking at a new functionality for you to >>> implement (or, if this throws things off too much, this may have to be noted >>> somewhere, that the XML vs. database counts may be off because the database >>> count is doing some text-based filtering but the XML count does not). >>> > >>> > What does xmlpipedb-match say? That will at least tell you whether the >>> 'RV%' count is indeed correct. >>> > >>> > John David N. Dionisio, PhD >>> > Associate Professor, Computer Science >>> > Loyola Marymount University >>> > >>> > >>> > >>> > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote: >>> > >>> > > OK here is what I was able to put together from the past few hours of >>> code review: >>> > > >>> > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: >>> > > -reveals that after the 2 System table modifications are made adding >>> species name and link, a PreparedStatement is instantiated which builds and >>> calls the base query. >>> > > >>> > > -The base query called is: ("SELECT value, type " + "FROM >>> genenametype INNER JOIN entrytype_genetype " + >>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type >>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >>> > > >>> > > -So its looking in 'ordered locus' table/column for any tuple that >>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? . >>> > > The 'like' comparator and % usage are clear with the 'type' >>> entrytype_gene_hjid = ? >>> > > >>> > > -To me it seems the query makes sense so the problem is likely >>> elsewhere. >>> > > >>> > > GenMappBuilder.java: >>> > > -I found method doTallies() at code line 895 which: >>> > > Instantiates a Configuration called hibernateConfiguration and >>> assigns to it the current hibernate configuration >>> > > Validates database settings by analyzing hibernateConfiguration >>> > > Instantiates a CriterionList for uniprot and assigns to it >>> TallyType.UNIPROT >>> > > Instantiates a CriterionList for go and assigns to it TallyType.GO >>> > > Determines if both xml files exist >>> > > Then getTallyResultsXML and getTallyResultsDatabase are run on both >>> xml files and their respective CriterionList >>> > > Results are then formatted for display in a table. >>> > > >>> > > -So enum TallyType which means that they are the only valid datatypes >>> which TallyEngine accepts... go to know ... >>> > > >>> > > -Based on the screen shot of Tally Engine it would seem that both >>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly >>> returning. Likely due to both using an incorrect query (as we previously >>> supposed). But where are the queries?... the more I dig the more I think >>> they are in the criterial all the work is done against. >>> > > >>> > > continuing the review: >>> > > getTallyResultsXML() calls Tally Engine instance method >>> getXmlFileCounts(xmlFile) >>> > > getTallyResultsDatabase() calls Tally Engine instance method >>> getDbcounts(new QueryEngine(hibernateConfiguration) >>> > > Both of these instanced methods originate from TallyEngine.java... >>> > > >>> > > TallyEngine.java: >>> > > >>> > > getXmlFileCounts() calls digestXmlFile() which instantiates a >>> digester then processes against criteria... but this quickly becomes >>> confusing and is hard to follow >>> > > >>> > > getDbcounts() then starts a db session and executes a query but then >>> I also get a bit lost with my limited db knowledge. >>> > > >>> > > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > > >>> > > OVERALL I think I'm getting closer to the issues but I still feel as >>> if I'm missing some understanding to proceed further. Can you pass along >>> some of that Dondi insight and steer me in the right direction? =D >>> > > >>> > > -DB Tally - Not having taken databases yet certainly is limiting my >>> ability determine where the "criteria" are being set and how they are >>> followed during session activities. Also is the query we have been looking >>> for this whole time in the criteria or someplace else? >>> > > >>> > > -XML Tally - again is the query contained within the criteria that >>> digestXmlFile() uses to parse? >>> > > >>> > > Richard >>> > > >>> > > >>> > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > Right, schema issues are unlikely. Most count discrepancies like >>> this that I've seen have boiled down to forming the right query. Then, >>> knowing the right query (in both XML and SQL), it's a matter of making sure >>> that TallyEngine asks that same query. >>> > > >>> > > John David N. Dionisio, PhD >>> > > Associate Professor, Computer Science >>> > > Loyola Marymount University >>> > > >>> > > >>> > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: >>> > > >>> > > > OK, so based on your approach: >>> > > > >>> > > > 1. I'll start with reviewing the queries for xmlpipedb-match and >>> sql queries needed for the respective results as you requested. >>> > > > >>> > > > I was also thinking I may need to review the schema from xml into >>> postgres but the issue isn't likely a schema error. The error most likely >>> lies in how xmlpipedbutils queries the data from xml source and writes to >>> the tables what it returns? >>> > > > >>> > > > 2. I'll review the code: trace the entrance of tally engine in the >>> gmbuilder code then follow it through the xmlpipedbutils. >>> > > > >>> > > > Richard >>> > > > >>> > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio < >>> do...@lm...> wrote: >>> > > > Just wanted to confirm (since I wasn't sure in the first e-mail) >>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in >>> SourceForge's Subversion repo. >>> > > > >>> > > > John David N. Dionisio, PhD >>> > > > Associate Professor, Computer Science >>> > > > Loyola Marymount University >>> > > > >>> > > > >>> > > > >>> > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote: >>> > > > >>> > > > > Hi Dondi, >>> > > > > >>> > > > > So I'm at the point in working with M tuberculosis that I was >>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results. >>> > > > > >>> > > > > gmb2b60 Results >>> > > > > >>> > > > > >>> > > > > >>> > > > > Now the proverbial question - What next to solve the Ordered >>> Locus import/count issue? >>> > > > > >>> > > > > ********************************************** >>> > > > > Here is my thought process: >>> > > > > >>> > > > > Step 1: How does the import process work at the high level? >>> (obviously correct me if I'm wrong) >>> > > > > >>> > > > > I believe that basically as each XML tag is read, it is placed in >>> the proper Postgres table(s) based on some criteria. There is also likely >>> some sort of check that each individual tag is in valid XML format unless we >>> don't care at this stage (care at export) or maybe the parser just skips >>> over and goes on to the next . >>> > > > > >>> > > > > Step 2: What could be the problem? >>> > > > > >>> > > > > Either - >>> > > > > a. XML tags are being parsed incorrectly (ignored/skipped)? >>> > > > > b. Decision criteria of which table they should be added to? >>> > > > > >>> > > > > ********************************************** >>> > > > > >>> > > > > I read on the sourceforge wiki: >>> > > > > >>> > > > > XMLPipeDB has a modular architecture with three components that >>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema >>> Definition) and automatically generates an SQL schema, Java classes, and >>> Hibernate mappings. XMLPipeDB Utilities provides functionality for >>> configuring the database, importing data, and performing queries. GenMAPP >>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible >>> Gene Databases based on data from UniProt and Gene Ontology (GO). >>> > > > > >>> > > > > So I should probably start with the XMLPipeDB Utilities which are >>> where? I don't see any in the basic distribution or are they not standalone >>> and called from the command line? >>> > > > > >>> > > > > Thanks! >>> > > > > >>> > > > > Richard >>> > > > >>> > > > >>> > > > <ATT00001..txt><ATT00002..txt> >>> > > >>> > > >>> > > >>> ------------------------------------------------------------------------------ >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > > Pinpoint memory and threading errors before they happen. >>> > > Find and fix more than 250 security defects in the development cycle. >>> > > Locate bottlenecks in serial and parallel code that limit >>> performance. >>> > > http://p.sf.net/sfu/intel-dev2devfeb >>> > > _______________________________________________ >>> > > xmlpipedb-developer mailing list >>> > > xml...@li... >>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > > >>> > > <ATT00001..txt><ATT00002..txt> >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >>> XE: >>> > Pinpoint memory and threading errors before they happen. >>> > Find and fix more than 250 security defects in the development cycle. >>> > Locate bottlenecks in serial and parallel code that limit performance. >>> > http://p.sf.net/sfu/intel-dev2devfeb >>> > _______________________________________________ >>> > xmlpipedb-developer mailing list >>> > xml...@li... >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > >>> > <ATT00001..txt><ATT00002..txt> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>> Pinpoint memory and threading errors before they happen. >>> Find and fix more than 250 security defects in the development cycle. >>> Locate bottlenecks in serial and parallel code that limit performance. >>> http://p.sf.net/sfu/intel-dev2devfeb >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >> >> > |