Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

also, how do I tag code in email so it holds its formatting? I tried a few
suggestions I found on the web but they aren't holding formatting or i'm
just doing it wrong ;-D

Richard

On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...> wrote:

> OK, have some updates and some suggestions:
>
> On Friday Dr. Dahlquist and I sat down and reviewed the gene testing
> report. We verified that XML match does indeed find 4066 unique matches - 75
> of which are not in the gdb and need to be.
>
> Dr. Dahlquist informed me that she was the one who completed the gene db
> testing report, not a previous student of BIO367 and had already verified
> which genes were missing and where they were to be found. I had (mistakenly)
> assumed that since a student had performed the gene database testing I had
> to redo all of the verification.
>
> So that said, of the 75 genes missing - 69 need to be included and 6
> excluded.
> Per the gene db testing report: "69 of them have an "a", "b", or "d"
> suffix. They are all found in the ORF tag and need to be included in the
> gdb."
>
> To solve this we need to add additional search criteria into the M.
> tuberculosis section in gmbuilder.properties below:
>
> # *Mycobacterium* tuberculosis
>
> mycobacteriumtuberculosis_level_amount=
> 1
>
> mycobacteriumtuberculosis_element_level0=
> *uniprot*/entry/gene/*name&type&ordered* locus
>
> mycobacteriumtuberculosis_query_level0=
> select count(*) from *genenametype* where type = 'ordered locus' and value
> like '*Rv*%';
>
> mycobacteriumtuberculosis_table_name_level0=
> Ordered Locus
> SOLUTIONS:
>
> 1. So am i correct in my understanding that the second line is the query
> used by TallyEngine to read the XML file? If so then this is the issue we
> need to table for the moment until we get the gbd verified and re-released.
> We will revisit this to discover why it is not only reporting incorrectly
> but also why its added a second row of Ordered Locus on the TallyEngine
> results page.
>
> 2. The third line is the SQL query used by postgres during the export from
> XML to gdb. To find and get the ORF tagged genes could we not add the
> following lines and change the count in the first line:
>
>
>
> # *Mycobacterium* tuberculosis
>
> mycobacteriumtuberculosis_level_amount=2
>
> mycobacteriumtuberculosis_element_level0=*uniprot*/entry/gene/*
> name&type&ordered* locus
>
> mycobacteriumtuberculosis_element_level1=*uniprot*/entry/gene/*
> name&type&orf*
>
> mycobacteriumtuberculosis_query_level0=
> select count(*) from *genenametype* where type = 'ordered locus';
>
> mycobacteriumtuberculosis_query_level1=select count(*) from *genenametype*
> where type = 'orf';
>
> mycobacteriumtuberculosis_table_name_level0=
> Ordered Locus
>
> mycobacteriumtuberculosis_table_name_level1=Ordered Locus
>
> ----------------------------------------------------------------------------------------------------------------------------
>
> Of course these queries would have be manually verified prior to making
> these changes but this seems like we are moving in the right direction.
>
> Richard
>
>
> On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote:
>
>> Just got done reading previous email and understand the change in
>> priority.
>>
>> Will work on the missing ID's for now and shelve the the TalleyEngine
>> issue for the moment.
>>
>> Also great about a more formalized weekly meeting. I was going to suggest
>> it myself as it has been slow going so far as maybe i'm a bit too
>> independent in this independent study class =D
>>
>> Will dig further into the missing ID's later tonight and during day
>> tomorrow and report back.
>>
>> Richard
>>
>>   On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>wrote:
>>
>>> Hi Rich,
>>>
>>> No problem.  The pertinent line you're referring to, for XML, is this,
>>> right above the line you copied:
>>>
>>>
>>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>>> locus
>>>
>>> The slash-separated section is the "path" of XML tags leading to the
>>> element of interest; then, after the ampersand, is a name/value pair for the
>>> desired attribute to count.  Note that there is no hint of a *content*-based
>>> filter (nor is there the capability for one, as far as I can tell in the
>>> code).  By "content," I mean that we can't specify filters based on what's
>>> *between* the tags.  We can only go as far as filter by attribute value,
>>> e.g., type="ordered locus".
>>>
>>> But anyway, as mentioned in the earlier e-mail, let's have the missing
>>> IDs in the .gdb take precedence for now.  Please take a look at the
>>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how
>>> the ID output can be customized, then let me know if you have any questions
>>> or need to confirm anything.
>>>
>>> John David N. Dionisio, PhD
>>> Associate Professor, Computer Science
>>> Loyola Marymount University
>>>
>>>
>>>
>>>  On Feb 17, 2011, at 3:04 PM, Richard Brous wrote:
>>>
>>> > Sorry been slammed with a programming assignment that kept needing
>>> continued iteration and it has been all consuming until last night. But I
>>> did get a chance to work with your comments and review the code again with a
>>> different mind set.
>>> >
>>> > Yes, I examined the gmbuilder.properties file ( the query is also in
>>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a
>>> previous email ) but I don't think I see what you mean regarding the XML
>>> count.
>>> >
>>> > I understood that: mycobacteriumtuberculosis_query_level0=select
>>> count(*) from genenametype where type = 'ordered locus' and value like
>>> 'Rv%';  was the db query but don't see which is the XML count... or do they
>>> share the same query and you are saying that XML count doesn't recognize and
>>> therefore cannot use the 'Rv%' parameter?
>>> >
>>> > Richard
>>> >
>>> >
>>> >
>>> > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <
>>> do...@lm...> wrote:
>>> > Hi Rich,
>>> >
>>> > Sorry for the delay.  Had some distractions coming into the weekend.
>>> >
>>> > You've looked at the code; have you looked at gmbuilder.properties?  (I
>>> may have mentioned it a few e-mails ago, just as you were starting to dig
>>> into this)
>>> >
>>> > On the copy I have, the M. tuberculosis block looks like this
>>> (indentation is mine to set it apart):
>>> >
>>> >        # Mycobacterium tuberculosis
>>> >        mycobacteriumtuberculosis_level_amount=1
>>> >
>>> >
>>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>>> locus
>>> >
>>> >        mycobacteriumtuberculosis_query_level0=select count(*) from
>>> genenametype where type = 'ordered locus' and value like 'Rv%';
>>> >
>>> >        mycobacteriumtuberculosis_table_name_level0=Ordered Locus
>>> >
>>> > There, I think, is the rub.  Notice that the XML count does not filter
>>> on RV%.  The SQL query does.
>>> >
>>> > Unfortunately, I don't think the TallyEngine can include selective
>>> filtering in the XML counts.  If the need to do selective filtering on XML
>>> is necessary, then I think we're looking at a new functionality for you to
>>> implement (or, if this throws things off too much, this may have to be noted
>>> somewhere, that the XML vs. database counts may be off because the database
>>> count is doing some text-based filtering but the XML count does not).
>>> >
>>> > What does xmlpipedb-match say?  That will at least tell you whether the
>>> 'RV%' count is indeed correct.
>>> >
>>> > John David N. Dionisio, PhD
>>> > Associate Professor, Computer Science
>>> > Loyola Marymount University
>>> >
>>> >
>>> >
>>> > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote:
>>> >
>>> > > OK here is what I was able to put together from the past few hours of
>>> code review:
>>> > >
>>> > > MycobacteriumTuberculosisUniProtSpeciesProfile.java:
>>> > > -reveals that after the 2 System table modifications are made adding
>>> species name and link, a PreparedStatement is instantiated which builds and
>>> calls the base query.
>>> > >
>>> > > -The base query called is: ("SELECT value, type " + "FROM
>>> genenametype INNER JOIN entrytype_genetype " +
>>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type
>>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?")
>>> > >
>>> > > -So its looking in 'ordered locus' table/column for any tuple that
>>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? .
>>> > > The 'like' comparator and % usage are clear with the 'type'
>>> entrytype_gene_hjid = ?
>>> > >
>>> > > -To me it seems the query makes sense so the problem is likely
>>> elsewhere.
>>> > >
>>> > > GenMappBuilder.java:
>>> > > -I found method doTallies() at code line 895 which:
>>> > > Instantiates a Configuration called hibernateConfiguration and
>>> assigns to it the current hibernate configuration
>>> > > Validates database settings by analyzing hibernateConfiguration
>>> > > Instantiates a CriterionList for uniprot and assigns to it
>>> TallyType.UNIPROT
>>> > > Instantiates a CriterionList for go and assigns to it TallyType.GO
>>> > > Determines if both xml files exist
>>> > > Then getTallyResultsXML and getTallyResultsDatabase are run on both
>>> xml files and their respective CriterionList
>>> > > Results are then formatted for display in a table.
>>> > >
>>> > > -So enum TallyType which means that they are the only valid datatypes
>>> which TallyEngine accepts... go to know ...
>>> > >
>>> > > -Based on the screen shot of Tally Engine it would seem that both
>>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly
>>> returning. Likely due to both using an incorrect query (as we previously
>>> supposed). But where are the queries?... the more I dig the more I think
>>> they are in the criterial all the work is done against.
>>> > >
>>> > > continuing the review:
>>> > > getTallyResultsXML() calls Tally Engine instance method
>>> getXmlFileCounts(xmlFile)
>>> > > getTallyResultsDatabase() calls Tally Engine instance method
>>> getDbcounts(new QueryEngine(hibernateConfiguration)
>>> > > Both of these instanced methods originate from TallyEngine.java...
>>> > >
>>> > > TallyEngine.java:
>>> > >
>>> > > getXmlFileCounts() calls digestXmlFile() which instantiates a
>>> digester then processes against criteria... but this quickly becomes
>>> confusing and is hard to follow
>>> > >
>>> > > getDbcounts() then starts a db session and executes a query but then
>>> I also get a bit lost with my limited db knowledge.
>>> > >
>>> > >
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> > >
>>> > > OVERALL I think I'm getting closer to the issues but I still feel as
>>> if I'm missing some understanding to proceed further. Can you pass along
>>> some of that Dondi insight and steer me in the right direction? =D
>>> > >
>>> > > -DB Tally - Not having taken databases yet certainly is limiting my
>>> ability determine where the "criteria" are being set and how they are
>>> followed during session activities. Also is the query we have been looking
>>> for this whole time in the criteria or someplace else?
>>> > >
>>> > > -XML Tally - again is the query contained within the criteria that
>>> digestXmlFile() uses to parse?
>>> > >
>>> > > Richard
>>> > >
>>> > >
>>> > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <
>>> do...@lm...> wrote:
>>> > > Right, schema issues are unlikely.  Most count discrepancies like
>>> this that I've seen have boiled down to forming the right query.  Then,
>>> knowing the right query (in both XML and SQL), it's a matter of making sure
>>> that TallyEngine asks that same query.
>>> > >
>>> > > John David N. Dionisio, PhD
>>> > > Associate Professor, Computer Science
>>> > > Loyola Marymount University
>>> > >
>>> > >
>>> > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote:
>>> > >
>>> > > > OK, so based on your approach:
>>> > > >
>>> > > > 1. I'll start with reviewing the queries for xmlpipedb-match and
>>> sql queries needed for the respective results as you requested.
>>> > > >
>>> > > > I was also thinking I may need to review the schema from xml into
>>> postgres but the issue isn't likely a schema error. The error most likely
>>> lies in how xmlpipedbutils queries the data from xml source and writes to
>>> the tables what it returns?
>>> > > >
>>> > > > 2. I'll review the code: trace the entrance of tally engine in the
>>> gmbuilder code then follow it through the xmlpipedbutils.
>>> > > >
>>> > > > Richard
>>> > > >
>>> > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <
>>> do...@lm...> wrote:
>>> > > > Just wanted to confirm (since I wasn't sure in the first e-mail)
>>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in
>>> SourceForge's Subversion repo.
>>> > > >
>>> > > > John David N. Dionisio, PhD
>>> > > > Associate Professor, Computer Science
>>> > > > Loyola Marymount University
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote:
>>> > > >
>>> > > > > Hi Dondi,
>>> > > > >
>>> > > > > So I'm at the point in working with M tuberculosis that I was
>>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results.
>>> > > > >
>>> > > > > gmb2b60 Results
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Now the proverbial question - What next to solve the Ordered
>>> Locus import/count issue?
>>> > > > >
>>> > > > > **********************************************
>>> > > > > Here is my thought process:
>>> > > > >
>>> > > > > Step 1: How does the import process work at the high level?
>>> (obviously correct me if I'm wrong)
>>> > > > >
>>> > > > > I believe that basically as each XML tag is read, it is placed in
>>> the proper Postgres table(s) based on some criteria. There is also likely
>>> some sort of check that each individual tag is in valid XML format unless we
>>> don't care at this stage (care at export) or maybe the parser just skips
>>> over and goes on to the next .
>>> > > > >
>>> > > > > Step 2: What could be the problem?
>>> > > > >
>>> > > > > Either -
>>> > > > > a. XML tags are being parsed incorrectly (ignored/skipped)?
>>> > > > > b. Decision criteria of which table they should be added to?
>>> > > > >
>>> > > > > **********************************************
>>> > > > >
>>> > > > > I read on the sourceforge wiki:
>>> > > > >
>>> > > > > XMLPipeDB has a modular architecture with three components that
>>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema
>>> Definition) and automatically generates an SQL schema, Java classes, and
>>> Hibernate mappings. XMLPipeDB Utilities provides functionality for
>>> configuring the database, importing data, and performing queries. GenMAPP
>>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible
>>> Gene Databases based on data from UniProt and Gene Ontology (GO).
>>> > > > >
>>> > > > > So I should probably start with the XMLPipeDB Utilities which are
>>> where? I don't see any in the basic distribution or are they not standalone
>>> and called from the command line?
>>> > > > >
>>> > > > > Thanks!
>>> > > > >
>>> > > > > Richard
>>> > > >
>>> > > >
>>> > > > <ATT00001..txt><ATT00002..txt>
>>> > >
>>> > >
>>> > >
>>> ------------------------------------------------------------------------------
>>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>>> XE:
>>> > > Pinpoint memory and threading errors before they happen.
>>> > > Find and fix more than 250 security defects in the development cycle.
>>> > > Locate bottlenecks in serial and parallel code that limit
>>> performance.
>>> > > http://p.sf.net/sfu/intel-dev2devfeb
>>> > > _______________________________________________
>>> > > xmlpipedb-developer mailing list
>>> > > xml...@li...
>>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>> > >
>>> > > <ATT00001..txt><ATT00002..txt>
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>>> XE:
>>> > Pinpoint memory and threading errors before they happen.
>>> > Find and fix more than 250 security defects in the development cycle.
>>> > Locate bottlenecks in serial and parallel code that limit performance.
>>> > http://p.sf.net/sfu/intel-dev2devfeb
>>> > _______________________________________________
>>> > xmlpipedb-developer mailing list
>>> > xml...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>> >
>>> > <ATT00001..txt><ATT00002..txt>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>>> Pinpoint memory and threading errors before they happen.
>>> Find and fix more than 250 security defects in the development cycle.
>>> Locate bottlenecks in serial and parallel code that limit performance.
>>> http://p.sf.net/sfu/intel-dev2devfeb
>>> _______________________________________________
>>> xmlpipedb-developer mailing list
>>> xml...@li...
>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>>
>>
>>
>