Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

OK, have some updates and some suggestions:

On Friday Dr. Dahlquist and I sat down and reviewed the gene testing report.
We verified that XML match does indeed find 4066 unique matches - 75 of
which are not in the gdb and need to be.

Dr. Dahlquist informed me that she was the one who completed the gene db
testing report, not a previous student of BIO367 and had already verified
which genes were missing and where they were to be found. I had (mistakenly)
assumed that since a student had performed the gene database testing I had
to redo all of the verification.

So that said, of the 75 genes missing - 69 need to be included and 6
excluded.
Per the gene db testing report: "69 of them have an "a", "b", or "d" suffix.
They are all found in the ORF tag and need to be included in the gdb."

To solve this we need to add additional search criteria into the M.
tuberculosis section in gmbuilder.properties below:

# *Mycobacterium* tuberculosis

mycobacteriumtuberculosis_level_amount=1

mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/*
name&type&ordered* locus

mycobacteriumtuberculosis_query_level0=select count(*) from *genenametype*
where type = 'ordered locus' and value like '*Rv*%';

mycobacteriumtuberculosis_table_name_level0=Ordered Locus
SOLUTIONS:

1. So am i correct in my understanding that the second line is the query
used by TallyEngine to read the XML file? If so then this is the issue we
need to table for the moment until we get the gbd verified and re-released.
We will revisit this to discover why it is not only reporting incorrectly
but also why its added a second row of Ordered Locus on the TallyEngine
results page.

2. The third line is the SQL query used by postgres during the export from
XML to gdb. To find and get the ORF tagged genes could we not add the
following lines and change the count in the first line:

# *Mycobacterium* tuberculosis

mycobacteriumtuberculosis_level_amount=2

mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/*
name&type&ordered* locus

mycobacteriumtuberculosis_element_level1=*uniprot*/entry/gene/*name&type&orf
*

mycobacteriumtuberculosis_query_level0=select count(*) from *genenametype*
where type = 'ordered locus';

mycobacteriumtuberculosis_query_level1=select count(*) from *genenametype*
where type = 'orf';

mycobacteriumtuberculosis_table_name_level0=Ordered Locus

mycobacteriumtuberculosis_table_name_level1=Ordered Locus
----------------------------------------------------------------------------------------------------------------------------

Of course these queries would have be manually verified prior to making
these changes but this seems like we are moving in the right direction.

Richard

On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote:

> Just got done reading previous email and understand the change in priority.
>
> Will work on the missing ID's for now and shelve the the TalleyEngine issue
> for the moment.
>
> Also great about a more formalized weekly meeting. I was going to suggest
> it myself as it has been slow going so far as maybe i'm a bit too
> independent in this independent study class =D
>
> Will dig further into the missing ID's later tonight and during day
> tomorrow and report back.
>
> Richard
>
>   On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote:
>
>> Hi Rich,
>>
>> No problem.  The pertinent line you're referring to, for XML, is this,
>> right above the line you copied:
>>
>>
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>>
>> The slash-separated section is the "path" of XML tags leading to the
>> element of interest; then, after the ampersand, is a name/value pair for the
>> desired attribute to count.  Note that there is no hint of a *content*-based
>> filter (nor is there the capability for one, as far as I can tell in the
>> code).  By "content," I mean that we can't specify filters based on what's
>> *between* the tags.  We can only go as far as filter by attribute value,
>> e.g., type="ordered locus".
>>
>> But anyway, as mentioned in the earlier e-mail, let's have the missing IDs
>> in the .gdb take precedence for now.  Please take a look at the
>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how
>> the ID output can be customized, then let me know if you have any questions
>> or need to confirm anything.
>>
>> John David N. Dionisio, PhD
>> Associate Professor, Computer Science
>> Loyola Marymount University
>>
>>
>>
>>  On Feb 17, 2011, at 3:04 PM, Richard Brous wrote:
>>
>> > Sorry been slammed with a programming assignment that kept needing
>> continued iteration and it has been all consuming until last night. But I
>> did get a chance to work with your comments and review the code again with a
>> different mind set.
>> >
>> > Yes, I examined the gmbuilder.properties file ( the query is also in the
>> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a
>> previous email ) but I don't think I see what you mean regarding the XML
>> count.
>> >
>> > I understood that: mycobacteriumtuberculosis_query_level0=select
>> count(*) from genenametype where type = 'ordered locus' and value like
>> 'Rv%';  was the db query but don't see which is the XML count... or do they
>> share the same query and you are saying that XML count doesn't recognize and
>> therefore cannot use the 'Rv%' parameter?
>> >
>> > Richard
>> >
>> >
>> >
>> > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <do...@lm...>
>> wrote:
>> > Hi Rich,
>> >
>> > Sorry for the delay.  Had some distractions coming into the weekend.
>> >
>> > You've looked at the code; have you looked at gmbuilder.properties?  (I
>> may have mentioned it a few e-mails ago, just as you were starting to dig
>> into this)
>> >
>> > On the copy I have, the M. tuberculosis block looks like this
>> (indentation is mine to set it apart):
>> >
>> >        # Mycobacterium tuberculosis
>> >        mycobacteriumtuberculosis_level_amount=1
>> >
>> >
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> >
>> >        mycobacteriumtuberculosis_query_level0=select count(*) from
>> genenametype where type = 'ordered locus' and value like 'Rv%';
>> >
>> >        mycobacteriumtuberculosis_table_name_level0=Ordered Locus
>> >
>> > There, I think, is the rub.  Notice that the XML count does not filter
>> on RV%.  The SQL query does.
>> >
>> > Unfortunately, I don't think the TallyEngine can include selective
>> filtering in the XML counts.  If the need to do selective filtering on XML
>> is necessary, then I think we're looking at a new functionality for you to
>> implement (or, if this throws things off too much, this may have to be noted
>> somewhere, that the XML vs. database counts may be off because the database
>> count is doing some text-based filtering but the XML count does not).
>> >
>> > What does xmlpipedb-match say?  That will at least tell you whether the
>> 'RV%' count is indeed correct.
>> >
>> > John David N. Dionisio, PhD
>> > Associate Professor, Computer Science
>> > Loyola Marymount University
>> >
>> >
>> >
>> > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote:
>> >
>> > > OK here is what I was able to put together from the past few hours of
>> code review:
>> > >
>> > > MycobacteriumTuberculosisUniProtSpeciesProfile.java:
>> > > -reveals that after the 2 System table modifications are made adding
>> species name and link, a PreparedStatement is instantiated which builds and
>> calls the base query.
>> > >
>> > > -The base query called is: ("SELECT value, type " + "FROM genenametype
>> INNER JOIN entrytype_genetype " + "ON(entrytype_genetype_name_hjid =
>> entrytype_genetype.hjid) " + "WHERE type = 'ordered locus' and value like
>> 'Rv%' and entrytype_gene_hjid = ?")
>> > >
>> > > -So its looking in 'ordered locus' table/column for any tuple that
>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? .
>> > > The 'like' comparator and % usage are clear with the 'type'
>> entrytype_gene_hjid = ?
>> > >
>> > > -To me it seems the query makes sense so the problem is likely
>> elsewhere.
>> > >
>> > > GenMappBuilder.java:
>> > > -I found method doTallies() at code line 895 which:
>> > > Instantiates a Configuration called hibernateConfiguration and assigns
>> to it the current hibernate configuration
>> > > Validates database settings by analyzing hibernateConfiguration
>> > > Instantiates a CriterionList for uniprot and assigns to it
>> TallyType.UNIPROT
>> > > Instantiates a CriterionList for go and assigns to it TallyType.GO
>> > > Determines if both xml files exist
>> > > Then getTallyResultsXML and getTallyResultsDatabase are run on both
>> xml files and their respective CriterionList
>> > > Results are then formatted for display in a table.
>> > >
>> > > -So enum TallyType which means that they are the only valid datatypes
>> which TallyEngine accepts... go to know ...
>> > >
>> > > -Based on the screen shot of Tally Engine it would seem that both
>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly
>> returning. Likely due to both using an incorrect query (as we previously
>> supposed). But where are the queries?... the more I dig the more I think
>> they are in the criterial all the work is done against.
>> > >
>> > > continuing the review:
>> > > getTallyResultsXML() calls Tally Engine instance method
>> getXmlFileCounts(xmlFile)
>> > > getTallyResultsDatabase() calls Tally Engine instance method
>> getDbcounts(new QueryEngine(hibernateConfiguration)
>> > > Both of these instanced methods originate from TallyEngine.java...
>> > >
>> > > TallyEngine.java:
>> > >
>> > > getXmlFileCounts() calls digestXmlFile() which instantiates a digester
>> then processes against criteria... but this quickly becomes confusing and is
>> hard to follow
>> > >
>> > > getDbcounts() then starts a db session and executes a query but then I
>> also get a bit lost with my limited db knowledge.
>> > >
>> > >
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> > > OVERALL I think I'm getting closer to the issues but I still feel as
>> if I'm missing some understanding to proceed further. Can you pass along
>> some of that Dondi insight and steer me in the right direction? =D
>> > >
>> > > -DB Tally - Not having taken databases yet certainly is limiting my
>> ability determine where the "criteria" are being set and how they are
>> followed during session activities. Also is the query we have been looking
>> for this whole time in the criteria or someplace else?
>> > >
>> > > -XML Tally - again is the query contained within the criteria that
>> digestXmlFile() uses to parse?
>> > >
>> > > Richard
>> > >
>> > >
>> > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...>
>> wrote:
>> > > Right, schema issues are unlikely.  Most count discrepancies like this
>> that I've seen have boiled down to forming the right query.  Then, knowing
>> the right query (in both XML and SQL), it's a matter of making sure that
>> TallyEngine asks that same query.
>> > >
>> > > John David N. Dionisio, PhD
>> > > Associate Professor, Computer Science
>> > > Loyola Marymount University
>> > >
>> > >
>> > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote:
>> > >
>> > > > OK, so based on your approach:
>> > > >
>> > > > 1. I'll start with reviewing the queries for xmlpipedb-match and sql
>> queries needed for the respective results as you requested.
>> > > >
>> > > > I was also thinking I may need to review the schema from xml into
>> postgres but the issue isn't likely a schema error. The error most likely
>> lies in how xmlpipedbutils queries the data from xml source and writes to
>> the tables what it returns?
>> > > >
>> > > > 2. I'll review the code: trace the entrance of tally engine in the
>> gmbuilder code then follow it through the xmlpipedbutils.
>> > > >
>> > > > Richard
>> > > >
>> > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > > > Just wanted to confirm (since I wasn't sure in the first e-mail) ---
>> the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in
>> SourceForge's Subversion repo.
>> > > >
>> > > > John David N. Dionisio, PhD
>> > > > Associate Professor, Computer Science
>> > > > Loyola Marymount University
>> > > >
>> > > >
>> > > >
>> > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote:
>> > > >
>> > > > > Hi Dondi,
>> > > > >
>> > > > > So I'm at the point in working with M tuberculosis that I was able
>> to exactly reproduce Dr. Dahlquist's problematic TallyEngine results.
>> > > > >
>> > > > > gmb2b60 Results
>> > > > >
>> > > > >
>> > > > >
>> > > > > Now the proverbial question - What next to solve the Ordered Locus
>> import/count issue?
>> > > > >
>> > > > > **********************************************
>> > > > > Here is my thought process:
>> > > > >
>> > > > > Step 1: How does the import process work at the high level?
>> (obviously correct me if I'm wrong)
>> > > > >
>> > > > > I believe that basically as each XML tag is read, it is placed in
>> the proper Postgres table(s) based on some criteria. There is also likely
>> some sort of check that each individual tag is in valid XML format unless we
>> don't care at this stage (care at export) or maybe the parser just skips
>> over and goes on to the next .
>> > > > >
>> > > > > Step 2: What could be the problem?
>> > > > >
>> > > > > Either -
>> > > > > a. XML tags are being parsed incorrectly (ignored/skipped)?
>> > > > > b. Decision criteria of which table they should be added to?
>> > > > >
>> > > > > **********************************************
>> > > > >
>> > > > > I read on the sourceforge wiki:
>> > > > >
>> > > > > XMLPipeDB has a modular architecture with three components that
>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema
>> Definition) and automatically generates an SQL schema, Java classes, and
>> Hibernate mappings. XMLPipeDB Utilities provides functionality for
>> configuring the database, importing data, and performing queries. GenMAPP
>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible
>> Gene Databases based on data from UniProt and Gene Ontology (GO).
>> > > > >
>> > > > > So I should probably start with the XMLPipeDB Utilities which are
>> where? I don't see any in the basic distribution or are they not standalone
>> and called from the command line?
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > Richard
>> > > >
>> > > >
>> > > > <ATT00001..txt><ATT00002..txt>
>> > >
>> > >
>> > >
>> ------------------------------------------------------------------------------
>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>> XE:
>> > > Pinpoint memory and threading errors before they happen.
>> > > Find and fix more than 250 security defects in the development cycle.
>> > > Locate bottlenecks in serial and parallel code that limit performance.
>> > > http://p.sf.net/sfu/intel-dev2devfeb
>> > > _______________________________________________
>> > > xmlpipedb-developer mailing list
>> > > xml...@li...
>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >
>> > > <ATT00001..txt><ATT00002..txt>
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>> XE:
>> > Pinpoint memory and threading errors before they happen.
>> > Find and fix more than 250 security defects in the development cycle.
>> > Locate bottlenecks in serial and parallel code that limit performance.
>> > http://p.sf.net/sfu/intel-dev2devfeb
>> > _______________________________________________
>> > xmlpipedb-developer mailing list
>> > xml...@li...
>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> >
>> > <ATT00001..txt><ATT00002..txt>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> xmlpipedb-developer mailing list
>> xml...@li...
>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>
>
>