Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

So here is the appropriate code snippet from
MycobacteriumTuberculosisUniProtSpeciesProfile.java:

*

public
*TableManager getSystemTableManagerCustomizations(TableManager tableManager,
TableManager primarySystemTableManager, Date version) *throws* SQLException,
InvalidParameterException {

// Build the base query; we only use "ordered locus" and we only want

// IDs that begin with "*Rv*."

PreparedStatement ps = ConnectionManager.*getRelationalDBConnection*
().prepareStatement(
"SELECT value, type " +

"FROM genenametype INNER JOIN entrytype_genetype " +

"ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " +

"WHERE type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid =
?");

ResultSet result;

*for* (Row row : primarySystemTableManager.getRows()) {

ps.setInt(1, Integer.*parseInt*(row.getValue(
"UID")));

result = ps.executeQuery();

// We actually want to keep the case where multiple ordered locus

// names appear.

*while* (result.next()) {

// We want this name to appear in the OrderedLocusNames

// system table.

*for* (String id : result.getString("value").split("/")) {

tableManager.submit(
"OrderedLocusNames", QueryType.*insert*, *new* String[][] { { "ID", id }, {
"Species", "|" + getSpeciesName() + "|" }, { "\"Date\"",
GenMAPPBuilderUtilities.*getSystemsDateString*(version) }, { "UID",
row.getValue("UID") } });

}

}

}

 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
So now we want to build the base query which uses "ordered locus" and "orf"
and we only want IDs that begin with "Rv".

I know there are more comprehensive ways to search for gene ID's by matching
gene ID prefix but "like Rv%" seemed to work thus far, we just need to tell
it to search for XML tag type orf in addition to ordered locus.

"WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and
entrytype_gene_hjid = ? "

Here is a stab at it.... This part of our class was right as the server went
down and my submission for week 6 assignment I can't seem to find.

Is it possible to have two different types in the same query or should we
rewrite a separate query for the orf tag?

Richard

On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote:

> thanks and will do as directed.
>
> My previous, last paragraph comment - A way for programming code in email
> holding its format in a mail message similarly to how you can post code on
> forum pages?
>
> <code>
> blah
> blah
> blah
> </code>
>
> thanks!
>
> Richard
>
>   On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <do...@lm...>wrote:
>
>
>> Greetings,
>>
>> Actually, gmbuilder.properties is for the TallyEngine only.  When dealing
>> with .gdb exports, look *only* at the SpeciesProfile class.  So, to find
>> those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile
>> code, that needs to be changed.
>>
>> Your take on how gmbuilder.properties is used, however, is understandable.
>>  It makes sense to assume that the TallyEngine code *and* the ID export code
>> are based on the same characterization of the needed IDs.  This replication
>> is originally a historical artifact: SpeciesProfile was done first, and then
>> TallyEngine was done later by another student.
>>
>> However, there are other factors beyond history that sort of necessitate
>> this duplication of desired IDs: (skip the two bullets below if you'd rather
>> cut to the chase of the work to be done, and discuss design issues later)
>>
>> - The actual XML import code is a black box: this is the "canned" JAXB
>> library actually in action, and not our code at all.  Plus, the XML import
>> code really does not filter (nor should it), since the goal of the
>> XML->relational database step is to fully capture the XML data in the
>> relational database.  So, XML count is necessarily separated from XML
>> import.
>>
>> - The notion of a declarative mechanism for extracting IDs from the
>> relational database (which is what gmbuilder.properties/TallyEngine uses) is
>> interesting, but at the same time there is value in the arbitrary
>> computation that can be done with Java (case in point: export two versions
>> of an ID, with and without periods).  This is not to say that it is
>> impossible to do this declaratively, but let's just say that the procedural
>> approach exists here and now, and a declarative approach will need more
>> thought.
>>
>> These, and other factors, are good thoughts to hold onto and would be
>> worthy of a good meeting discussion sometime, but bottom line for now:
>> modifying the export behavior is a matter of editing the *SpeciesProfile*
>> Java code, and not the gmbuilder.properties file.  Turn your attention to
>> that code.
>>
>> Now, as to annotating your code...I'd just put in code comments  :)  Or
>> did you mean something else by tagging code in e-mail?
>>
>>  John David N. Dionisio, PhD
>> Associate Professor, Computer Science
>> Loyola Marymount University
>>
>>
>>
>>
>>  On Feb 21, 2011, at 12:38 AM, Richard Brous wrote:
>>
>> > also, how do I tag code in email so it holds its formatting? I tried a
>> few suggestions I found on the web but they aren't holding formatting or i'm
>> just doing it wrong ;-D
>> >
>> > Richard
>> >
>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > OK, have some updates and some suggestions:
>> >
>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene testing
>> report. We verified that XML match does indeed find 4066 unique matches - 75
>> of which are not in the gdb and need to be.
>> >
>> > Dr. Dahlquist informed me that she was the one who completed the gene db
>> testing report, not a previous student of BIO367 and had already verified
>> which genes were missing and where they were to be found. I had (mistakenly)
>> assumed that since a student had performed the gene database testing I had
>> to redo all of the verification.
>> >
>> > So that said, of the 75 genes missing - 69 need to be included and 6
>> excluded.
>> > Per the gene db testing report: "69 of them have an "a", "b", or "d"
>> suffix. They are all found in the ORF tag and need to be included in the
>> gdb."
>> >
>> > To solve this we need to add additional search criteria into the M.
>> tuberculosis section in gmbuilder.properties below:
>> > # Mycobacterium tuberculosis
>> >
>> > mycobacteriumtuberculosis_level_amount=
>> >
>> > 1
>> >
>> > mycobacteriumtuberculosis_element_level0=
>> >
>> > uniprot/entry/gene/name&type&ordered locus
>> >
>> > mycobacteriumtuberculosis_query_level0=
>> >
>> > select count(*) from genenametype where type = 'ordered locus' and value
>> like 'Rv%';
>> >
>> > mycobacteriumtuberculosis_table_name_level0=
>> >
>> > Ordered Locus
>> > SOLUTIONS:
>> >
>> > 1. So am i correct in my understanding that the second line is the query
>> used by TallyEngine to read the XML file? If so then this is the issue we
>> need to table for the moment until we get the gbd verified and re-released.
>> We will revisit this to discover why it is not only reporting incorrectly
>> but also why its added a second row of Ordered Locus on the TallyEngine
>> results page.
>> >
>> > 2. The third line is the SQL query used by postgres during the export
>> from XML to gdb. To find and get the ORF tagged genes could we not add the
>> following lines and change the count in the first line:
>> >
>> >
>> > # Mycobacterium tuberculosis
>> >
>> > mycobacteriumtuberculosis_level_amount=2
>> >
>> >
>> >
>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> >
>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf
>> >
>> >
>> > mycobacteriumtuberculosis_query_level0=
>> >
>> > select count(*) from genenametype where type = 'ordered locus';
>> > mycobacteriumtuberculosis_query_level1=select count(*) from genenametype
>> where type = 'orf';
>> >
>> >
>> > mycobacteriumtuberculosis_table_name_level0=
>> >
>> > Ordered Locus
>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------
>> >
>> > Of course these queries would have be manually verified prior to making
>> these changes but this seems like we are moving in the right direction.
>> >
>> > Richard
>> >
>> >
>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > Just got done reading previous email and understand the change in
>> priority.
>> >
>> > Will work on the missing ID's for now and shelve the the TalleyEngine
>> issue for the moment.
>> >
>> > Also great about a more formalized weekly meeting. I was going to
>> suggest it myself as it has been slow going so far as maybe i'm a bit too
>> independent in this independent study class =D
>> >
>> > Will dig further into the missing ID's later tonight and during day
>> tomorrow and report back.
>> >
>> > Richard
>> >
>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <do...@lm...>
>> wrote:
>> > Hi Rich,
>> >
>> > No problem.  The pertinent line you're referring to, for XML, is this,
>> right above the line you copied:
>> >
>> >
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> >
>> > The slash-separated section is the "path" of XML tags leading to the
>> element of interest; then, after the ampersand, is a name/value pair for the
>> desired attribute to count.  Note that there is no hint of a *content*-based
>> filter (nor is there the capability for one, as far as I can tell in the
>> code).  By "content," I mean that we can't specify filters based on what's
>> *between* the tags.  We can only go as far as filter by attribute value,
>> e.g., type="ordered locus".
>> >
>> > But anyway, as mentioned in the earlier e-mail, let's have the missing
>> IDs in the .gdb take precedence for now.  Please take a look at the
>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how
>> the ID output can be customized, then let me know if you have any questions
>> or need to confirm anything.
>> >
>> > John David N. Dionisio, PhD
>> > Associate Professor, Computer Science
>> > Loyola Marymount University
>> >
>> >
>> >
>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote:
>> >
>> > > Sorry been slammed with a programming assignment that kept needing
>> continued iteration and it has been all consuming until last night. But I
>> did get a chance to work with your comments and review the code again with a
>> different mind set.
>> > >
>> > > Yes, I examined the gmbuilder.properties file ( the query is also in
>> the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a
>> previous email ) but I don't think I see what you mean regarding the XML
>> count.
>> > >
>> > > I understood that: mycobacteriumtuberculosis_query_level0=select
>> count(*) from genenametype where type = 'ordered locus' and value like
>> 'Rv%';  was the db query but don't see which is the XML count... or do they
>> share the same query and you are saying that XML count doesn't recognize and
>> therefore cannot use the 'Rv%' parameter?
>> > >
>> > > Richard
>> > >
>> > >
>> > >
>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > > Hi Rich,
>> > >
>> > > Sorry for the delay.  Had some distractions coming into the weekend.
>> > >
>> > > You've looked at the code; have you looked at gmbuilder.properties?
>>  (I may have mentioned it a few e-mails ago, just as you were starting to
>> dig into this)
>> > >
>> > > On the copy I have, the M. tuberculosis block looks like this
>> (indentation is mine to set it apart):
>> > >
>> > >        # Mycobacterium tuberculosis
>> > >        mycobacteriumtuberculosis_level_amount=1
>> > >
>> > >
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> > >
>> > >        mycobacteriumtuberculosis_query_level0=select count(*) from
>> genenametype where type = 'ordered locus' and value like 'Rv%';
>> > >
>> > >        mycobacteriumtuberculosis_table_name_level0=Ordered Locus
>> > >
>> > > There, I think, is the rub.  Notice that the XML count does not filter
>> on RV%.  The SQL query does.
>> > >
>> > > Unfortunately, I don't think the TallyEngine can include selective
>> filtering in the XML counts.  If the need to do selective filtering on XML
>> is necessary, then I think we're looking at a new functionality for you to
>> implement (or, if this throws things off too much, this may have to be noted
>> somewhere, that the XML vs. database counts may be off because the database
>> count is doing some text-based filtering but the XML count does not).
>> > >
>> > > What does xmlpipedb-match say?  That will at least tell you whether
>> the 'RV%' count is indeed correct.
>> > >
>> > > John David N. Dionisio, PhD
>> > > Associate Professor, Computer Science
>> > > Loyola Marymount University
>> > >
>> > >
>> > >
>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote:
>> > >
>> > > > OK here is what I was able to put together from the past few hours
>> of code review:
>> > > >
>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java:
>> > > > -reveals that after the 2 System table modifications are made adding
>> species name and link, a PreparedStatement is instantiated which builds and
>> calls the base query.
>> > > >
>> > > > -The base query called is: ("SELECT value, type " + "FROM
>> genenametype INNER JOIN entrytype_genetype " +
>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type
>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?")
>> > > >
>> > > > -So its looking in 'ordered locus' table/column for any tuple that
>> starts with Rv (followed by any substring) and entrytype_gene_hjid = ? .
>> > > > The 'like' comparator and % usage are clear with the 'type'
>> entrytype_gene_hjid = ?
>> > > >
>> > > > -To me it seems the query makes sense so the problem is likely
>> elsewhere.
>> > > >
>> > > > GenMappBuilder.java:
>> > > > -I found method doTallies() at code line 895 which:
>> > > > Instantiates a Configuration called hibernateConfiguration and
>> assigns to it the current hibernate configuration
>> > > > Validates database settings by analyzing hibernateConfiguration
>> > > > Instantiates a CriterionList for uniprot and assigns to it
>> TallyType.UNIPROT
>> > > > Instantiates a CriterionList for go and assigns to it TallyType.GO
>> > > > Determines if both xml files exist
>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on both
>> xml files and their respective CriterionList
>> > > > Results are then formatted for display in a table.
>> > > >
>> > > > -So enum TallyType which means that they are the only valid
>> datatypes which TallyEngine accepts... go to know ...
>> > > >
>> > > > -Based on the screen shot of Tally Engine it would seem that both
>> getTallyResultsXML() and getTallyResultsDatabase() are incorrectly
>> returning. Likely due to both using an incorrect query (as we previously
>> supposed). But where are the queries?... the more I dig the more I think
>> they are in the criterial all the work is done against.
>> > > >
>> > > > continuing the review:
>> > > > getTallyResultsXML() calls Tally Engine instance method
>> getXmlFileCounts(xmlFile)
>> > > > getTallyResultsDatabase() calls Tally Engine instance method
>> getDbcounts(new QueryEngine(hibernateConfiguration)
>> > > > Both of these instanced methods originate from TallyEngine.java...
>> > > >
>> > > > TallyEngine.java:
>> > > >
>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a
>> digester then processes against criteria... but this quickly becomes
>> confusing and is hard to follow
>> > > >
>> > > > getDbcounts() then starts a db session and executes a query but then
>> I also get a bit lost with my limited db knowledge.
>> > > >
>> > > >
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > > >
>> > > > OVERALL I think I'm getting closer to the issues but I still feel as
>> if I'm missing some understanding to proceed further. Can you pass along
>> some of that Dondi insight and steer me in the right direction? =D
>> > > >
>> > > > -DB Tally - Not having taken databases yet certainly is limiting my
>> ability determine where the "criteria" are being set and how they are
>> followed during session activities. Also is the query we have been looking
>> for this whole time in the criteria or someplace else?
>> > > >
>> > > > -XML Tally - again is the query contained within the criteria that
>> digestXmlFile() uses to parse?
>> > > >
>> > > > Richard
>> > > >
>> > > >
>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > > > Right, schema issues are unlikely.  Most count discrepancies like
>> this that I've seen have boiled down to forming the right query.  Then,
>> knowing the right query (in both XML and SQL), it's a matter of making sure
>> that TallyEngine asks that same query.
>> > > >
>> > > > John David N. Dionisio, PhD
>> > > > Associate Professor, Computer Science
>> > > > Loyola Marymount University
>> > > >
>> > > >
>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote:
>> > > >
>> > > > > OK, so based on your approach:
>> > > > >
>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match and
>> sql queries needed for the respective results as you requested.
>> > > > >
>> > > > > I was also thinking I may need to review the schema from xml into
>> postgres but the issue isn't likely a schema error. The error most likely
>> lies in how xmlpipedbutils queries the data from xml source and writes to
>> the tables what it returns?
>> > > > >
>> > > > > 2. I'll review the code: trace the entrance of tally engine in the
>> gmbuilder code then follow it through the xmlpipedbutils.
>> > > > >
>> > > > > Richard
>> > > > >
>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > > > > Just wanted to confirm (since I wasn't sure in the first e-mail)
>> --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils in
>> SourceForge's Subversion repo.
>> > > > >
>> > > > > John David N. Dionisio, PhD
>> > > > > Associate Professor, Computer Science
>> > > > > Loyola Marymount University
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote:
>> > > > >
>> > > > > > Hi Dondi,
>> > > > > >
>> > > > > > So I'm at the point in working with M tuberculosis that I was
>> able to exactly reproduce Dr. Dahlquist's problematic TallyEngine results.
>> > > > > >
>> > > > > > gmb2b60 Results
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Now the proverbial question - What next to solve the Ordered
>> Locus import/count issue?
>> > > > > >
>> > > > > > **********************************************
>> > > > > > Here is my thought process:
>> > > > > >
>> > > > > > Step 1: How does the import process work at the high level?
>> (obviously correct me if I'm wrong)
>> > > > > >
>> > > > > > I believe that basically as each XML tag is read, it is placed
>> in the proper Postgres table(s) based on some criteria. There is also likely
>> some sort of check that each individual tag is in valid XML format unless we
>> don't care at this stage (care at export) or maybe the parser just skips
>> over and goes on to the next .
>> > > > > >
>> > > > > > Step 2: What could be the problem?
>> > > > > >
>> > > > > > Either -
>> > > > > > a. XML tags are being parsed incorrectly (ignored/skipped)?
>> > > > > > b. Decision criteria of which table they should be added to?
>> > > > > >
>> > > > > > **********************************************
>> > > > > >
>> > > > > > I read on the sourceforge wiki:
>> > > > > >
>> > > > > > XMLPipeDB has a modular architecture with three components that
>> may be used separately or together. XSD-to-DB reads an XSD (XML Schema
>> Definition) and automatically generates an SQL schema, Java classes, and
>> Hibernate mappings. XMLPipeDB Utilities provides functionality for
>> configuring the database, importing data, and performing queries. GenMAPP
>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible
>> Gene Databases based on data from UniProt and Gene Ontology (GO).
>> > > > > >
>> > > > > > So I should probably start with the XMLPipeDB Utilities which
>> are where? I don't see any in the basic distribution or are they not
>> standalone and called from the command line?
>> > > > > >
>> > > > > > Thanks!
>> > > > > >
>> > > > > > Richard
>> > > > >
>> > > > >
>> > > > > <ATT00001..txt><ATT00002..txt>
>> > > >
>> > > >
>> > > >
>> ------------------------------------------------------------------------------
>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel
>> Studio XE:
>> > > > Pinpoint memory and threading errors before they happen.
>> > > > Find and fix more than 250 security defects in the development
>> cycle.
>> > > > Locate bottlenecks in serial and parallel code that limit
>> performance.
>> > > > http://p.sf.net/sfu/intel-dev2devfeb
>> > > > _______________________________________________
>> > > > xmlpipedb-developer mailing list
>> > > > xml...@li...
>> > > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > > >
>> > > > <ATT00001..txt><ATT00002..txt>
>> > >
>> > >
>> > >
>> ------------------------------------------------------------------------------
>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>> XE:
>> > > Pinpoint memory and threading errors before they happen.
>> > > Find and fix more than 250 security defects in the development cycle.
>> > > Locate bottlenecks in serial and parallel code that limit performance.
>> > > http://p.sf.net/sfu/intel-dev2devfeb
>> > > _______________________________________________
>> > > xmlpipedb-developer mailing list
>> > > xml...@li...
>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >
>> > > <ATT00001..txt><ATT00002..txt>
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
>> XE:
>> > Pinpoint memory and threading errors before they happen.
>> > Find and fix more than 250 security defects in the development cycle.
>> > Locate bottlenecks in serial and parallel code that limit performance.
>> > http://p.sf.net/sfu/intel-dev2devfeb
>> > _______________________________________________
>> > xmlpipedb-developer mailing list
>> > xml...@li...
>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> >
>> >
>> >
>> > <ATT00001..txt><ATT00002..txt>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> xmlpipedb-developer mailing list
>> xml...@li...
>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>
>>
>
>