Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Great!

Now the thumbs up from Dondi and we can get cracking. I'll dig into sql
syntax for negation of a specific tuple while we wait for his response.

Richard

On Wed, Feb 23, 2011 at 11:37 AM, Kam Dahlquist <kda...@lm...> wrote:

> Hi,
>
> That sounds good to me.
>
> Best,
> Dr. D
>
>
> At 11:16 AM 2/23/2011, you wrote:
>
> Hmm... possible that the method to split up records has been globally
> implemented for export from Postgres?
>
> Since the records are automatically splitting when exported from postgres
> to gdb file, all we need to do is exclude Rv3346/55c 'ORF' from "select
> count (*) from genenametype where (type = 'ordered locus' or type = 'ORF')
> and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set.
>
> That seem correct Dondi?
>
> Richard
>
> On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist <kda...@lm...>
> wrote:
>  Hi,
>
> I checked these out.  They all should be separated and kept as separate
> OrderedLocusNames records.  In fact, the gdb already has them separated into
> individual records, so it is only the tally for Postgres that is off, they
> were correctly exported into the gdb.
>
> Best,
> Dr. D
>
>
> At 07:31 PM 2/22/2011, Richard Brous wrote:
>
> OK I found the culprits:
>
> There are 3 ordered locus gene ID's that have slashes. As was previously
> discussed it seems XML Match is counting each one twice which would inflate
> the count by 3.
>
> Rv2561/Rv2562       'ordered locus'
> Rv2880c/Rv2879c   'ordered locus'
> Rv3021c/Rv3022c   'ordered locus'
>
> And as was previously noted, Rv3346/55c   'ORF' should be excluded.
>
> This will bring the both XML and sql db queries counts in sync at 4057.
>
> By my math:
>
> ------------------------------------------------------------------------------------------------------------
> XML tags read by Match:
> 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused by
> slashes = total 4057 genes
>
> ------------------------------------------------------------------------------------------------------------
> sql query of the db tables:
> 3988 ordered locus + 69 ORF = total 4057 genes
> ------------------------------------------------------------------------------------------------------------
>
>
> TO MOVE FORWARD:
>
> 1. We need to decide how to address the slash ID's (should we keep and
> split into separate tuples or should they be excluded)
>
> 2. Adjust the queries to reflect what is needed and test with raw sql.
>
> 3. Update the queries on
> MycobacteriumTuberculosisUniProtSpeciesProfile.java
>
> 4. Export a new Mtb gene database for testing.
>
>
> Richard
>
>
>
> On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...> wrote:
> Hi,
> I don't know why the numbers are off.  I think the only way to find out is
> to get the results of the Postgres query and line it up next to the match
> results and see what is different.  If you can send me the lists of IDs from
> the Postgres query and the match results, I can check them.
> Best, Dr. D
>
> At 04:57 PM 2/22/2011, you wrote:
>
> Here is where I am on the numbers: ------------------------------------------------------------------------------------------------------------
> XML tags read by Match: 4066 unique matches - 6 genes excluded by Dr. D =
> total 4060 genes ------------------------------------------------------------------------------------------------------------
> sql query of the db tables: 3988 ordered locus + 69 ORF = total 4057 genes
> ------------------------------------------------------------------------------------------------------------
> So by my count we are off by 3 genes   Maybe XML Match is counting slashed
> genes as 2 separate genes?   IE. so if it encountered 3 slashed genes it
> would in effect be counting those 3 as 6?   Richard
> On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...> wrote:
> Hi, I compared this list of gene IDs with the list I had on the Testing
> Report on the wiki and found the following. Your list has 70 IDs.  69 of
> them are identical to what is in my list of IDs in the Testing report. Indeed,
> the only difference is the one with the slash.  "Rv3346/55c"  I looked this
> ID up at UniProt.org as Rv3346.  It appears in the record for UniProt ID
> O50384.  For that UniProt record, the ID referred to in the OrderedLocus tag
> is Rv3355c which does appear in the gdb already.  Looking up Rv3346 on
> Tuberculist and the Stanford TB database, Rv3346 is not a real gene ID.  So
> this entire record with the slash in it needs to be excluded and all the
> rest of them need to be captured.
> I think then, that the numbers should add up correctly, is this true? Best,
> Dr. D
> At 01:09 PM 2/22/2011, Richard Brous wrote:
>
> attachement now included... On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous
> <rbr...@gm...> wrote: Here is an export of the genes found using:
> select * from genenametype where type = 'ORF' and value ~
> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached as a csv file.
> 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2
> 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4
> 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2
> 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2
> 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2
> 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2
> 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3
> 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2
> 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2
> 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2
> 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2
> 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2
> 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2
> 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2
> 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2
> 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2
> 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2
> 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2
> 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3
> 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2
> 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2
> 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2
> 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1
> 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1
> 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1
> 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2
> 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2
> 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2
> 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2
> 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1
> 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1
> 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1
> 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1
> 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2
> 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2
> 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1
> 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1
> 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1
> 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1
> 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1
> 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1
> 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1
> 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1
> 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1
> 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1
> 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1
> 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1
> 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1
> 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1
> 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1
> 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1
> 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2
> 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2
> 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1
> 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1
> 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1
> 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1
> 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2
> 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2
> 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2
> 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2
> 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2
> 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2
> 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2
> 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2
> 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2
> 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2
> 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2
> 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2
> 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2 *****The
> item of note I see is the gene with the slash separating gene id's which
> refer to the same gene.   Richard   On Mon, Feb 21, 2011 at 11:11 PM,
> Richard Brous <rbr...@gm...> wrote: Understood.   I'll check in with
> Dr. D in the afternoon tomorrow and discuss.   Richard
>
> On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...>
> wrote: Hi Rich, Addressing the release business first, let's put it this
> way: if the remaining loose ends can be addressed by tomorrow, we can
> probably wait until then.  If unexpected snags are encountered, then it
> would be worthwhile to release whatever you have. With that in mind,
> considering that you pretty much know the patterns of the IDs that are
> needed, I think it will only take a little digital forensic work now to
> figure out exactly which IDs are still needed.  Once you know what those
> are, you should: 1. Find where they are in the XML file. 2. Knowing the
> XML location, find the corresponding table in the relational database (table
> names are generally derived from tag/element names). 3. Knowing the table
> in the database, write or extend the SpeciesProfile query to retrieve that
> data. For the ID that must *not* be included, again it's a matter of
> tracking down what this ID is.  Knowing this straggler, you can then consult
> with Dr. Dahlquist if this ID is truly a unique one-off, or is
> representative of a pattern that we'll want to exclude.  Either way, this ID
> can be omitted by using "not" or "<>" or possibly "not like" or "not ~"
> (check PostgreSQL where clause syntax to see where the negation can be
> applied). John David N. Dionisio, PhD Associate Professor, Computer
> Science Loyola Marymount University
> On Feb 22, 2011, at 1:37 AM, Richard Brous wrote: > actually i had a typo
> (emailing from desktop system but testing on my laptop... typed correctly
> here but wrong in pgadmin) but the results make much more sense now. > > >
> select count (*) from genenametype where (type = 'ordered locus' or type =
> 'ORF') and value like 'Rv%'; > returns 4058 > > select count (*) from
> genenametype where (type = 'ordered locus' or type = 'ORF') and value ~
> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > returns 4058 > >
> --------------------------------------------------------------------------------------------------------------------------
> > > > Continuing forward - > > The testing report says that 4066 unique
> matches exist in XML but 6 of them were eliminated by Dr. D leaving the
> desired number at 4060. > > So now we are only 2 genes short with the
> query returning 4058... which is also (conveniently) the sum of the two
> separate queries of 'ordered locus' and 'ORF' respectively. > >  But
> recall that Dr. D said that only 69 genes of the missing 75 were tagged
> 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected. Adding
> that into missing genes puts us 3 short... > > Should I make the changes
> to the code and export a gdb so that analysis can be done or wait until we
> work this through further? > > Richard > > > > On Mon, Feb 21, 2011 at
> 10:04 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > The
> second form should have worked actually.  What exactly was the error? > >
> John David N. Dionisio, PhD > Associate Professor, Computer Science >
> Loyola Marymount University > > > > On Feb 22, 2011, at 1:01 AM, Richard
> Brous wrote: > > > hmm not taking parenthesis where I thought they should
> go... syntax error > > > > select count (*) from genenametype where type =
> ('ordered locus' or  'ORF') and value like 'Rv%'; > > also tried > >
> select count (*) from genenametype where (type = 'ordered locus' or type =
> 'ORF') and value like 'Rv%'; > > > > > > > > > > > > On Mon, Feb 21, 2011
> at 9:40 PM, Richard Brous <rbr...@gm...> wrote: > > ah yes... i see
> it... > > > > > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio
> <do...@lm...> wrote: > > Watch your parentheses: "and" has greater
> precedence than "or" :) > > > > > > John David N. Dionisio, PhD > >
> Associate Professor, Computer Science > > Loyola Marymount University > > >
> > > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...>
> wrote: > > > >> OK, so here are my query results from raw SQL: > >> > >>
> 1. using: like 'Rv%' > >> > >> select count (*) from genenametype where
> type = 'ordered locus' and value like 'Rv%'; > >> returns 3988 > >> > >>
> select count (*) from genenametype where type = 'ORF' and value like 'Rv%'; >
> >> returns 70 > >> > >> select count (*) from genenametype where type =
> 'ordered locus' or type = 'ORF' and value like 'Rv%'; > >> returns 7011 >
> >> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*' >
> >> > >> select count (*) from genenametype where type = 'ordered locus'
> and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 3988 > >> > >>
> select count (*) from genenametype where type = 'ordered locus' or type =
> 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 7011 > >> >
> >> select count (*) from genenametype where type = 'ORF' and value ~
> '[Rr][Vv][0-9][0-9][0-9][0-9]*'; > >> returns 70 > >> > >> Conclusions: >
> >> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69
> genes were were looking for plus one more (maybe the count for missing genes
> is off by 1?). > >> > >> 2. Combining the two types in a single query did
> not produce the results that I expected (7011? - how did that happen????) so
> this is likely not our solution... unless of course the query syntax isn't
> actually doing what I think it is... > >> > >> 3. I would think the best
> course of action is to serialy run two separate queries to capture all the
> required genes, then removing the one unneeded gene if its truly not wanted.
> > >> > >> What do you think? > >> > >> Richard > >> > >> > >> On Mon, Feb
> 21, 2011 at 5:17 PM, John David N. Dionisio <do...@lm...> wrote: > >> I
> don't recall the exact details of the missing 69, but if your query
> successfully returns them in raw SQL, then this is worth a try.  You can
> integrate into the same query as long as the same columns are returned,
> which is the case here AFAIK, so go ahead and extend the existing query. >
> >> > >> > >> John David N. Dionisio, PhD > >> Associate Professor,
> Computer Science > >> Loyola Marymount University > >> > >> On Feb 21,
> 2011, at 6:56 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> So
> here is the appropriate code snippet from
> MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> public > >>> >
> >>> TableManager getSystemTableManagerCustomizations(TableManager
> tableManager, TableManager primarySystemTableManager, Date version) throws
> SQLException, InvalidParameterException { > >>> > >>> // Build the base
> query; we only use "ordered locus" and we only want > >>> > >>> // IDs
> that begin with "Rv." > >>> PreparedStatement ps =
> ConnectionManager.getRelationalDBConnection().prepareStatement( > >>> >
> >>> "SELECT value, type " + > >>> > >>> "FROM genenametype INNER JOIN
> entrytype_genetype " + > >>> > >>> "ON (entrytype_genetype_name_hjid =
> entrytype_genetype.hjid) " + > >>> > >>> "WHERE type = 'ordered locus' and
> value like 'Rv%' and entrytype_gene_hjid = ?"); > >>> ResultSet result; >
> >>> > >>> > >>> > >>> for (Row row : primarySystemTableManager.getRows())
> { > >>> ps.setInt(1, Integer.parseInt(row.getValue( > >>> > >>> "UID"))); >
> >>> result = ps.executeQuery(); > >>> > >>> > >>> > >>> // We actually
> want to keep the case where multiple ordered locus > >>> > >>> // names
> appear. > >>> > >>> while (result.next()) { > >>> > >>> // We want this
> name to appear in the OrderedLocusNames > >>> > >>> // system table. > >>>
> > >>> for (String id : result.getString("value").split("/")) { > >>>
> tableManager.submit( > >>> > >>> "OrderedLocusNames", QueryType.insert,
> new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" },
> { "\"Date\"", GenMAPPBuilderUtilities.getSystemsDateString(version) }, {
> "UID", row.getValue("UID") } }); > >>> } > >>> > >>> } > >>> > >>> } > >>>
> > >>>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >>> So now we want to build the base query which uses "ordered locus" and
> "orf" and we only want IDs that begin with "Rv". > >>> > >>> I know there
> are more comprehensive ways to search for gene ID's by matching gene ID
> prefix but "like Rv%" seemed to work thus far, we just need to tell it to
> search for XML tag type orf in addition to ordered locus. > >>> > >>>
> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%' and
> entrytype_gene_hjid = ? " > >>> > >>> Here is a stab at it.... This part
> of our class was right as the server went down and my submission for week 6
> assignment I can't seem to find. > >>> > >>> Is it possible to have two
> different types in the same query or should we rewrite a separate query for
> the orf tag? > >>> > >>> Richard > >>> > >>> > >>> > >>> On Sun, Feb 20,
> 2011 at 10:21 PM, Richard Brous <rbr...@gm...> wrote: > >>> > >>>
> thanks and will do as directed. > >>> > >>> My previous, last paragraph
> comment - A way for programming code in email holding its format in a mail
> message similarly to how you can post code on forum pages? > >>> > >>>
> <code> > >>> blah > >>> blah > >>> blah > >>> </code> > >>> > >>> thanks! >
> >>> > >>> Richard > >>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David
> N. Dionisio <do...@lm...> wrote: > >>> > >>> Greetings, > >>> > >>>
> Actually, gmbuilder.properties is for the TallyEngine only.  When dealing
> with .gdb exports, look *only* at the SpeciesProfile class.  So, to find
> those 69 IDs, it is the SpeciesProfile code, and *only* the SpeciesProfile
> code, that needs to be changed. > >>> > >>> Your take on how
> gmbuilder.properties is used, however, is understandable.  It makes sense to
> assume that the TallyEngine code *and* the ID export code are based on the
> same characterization of the needed IDs.  This replication is originally a
> historical artifact: SpeciesProfile was done first, and then TallyEngine was
> done later by another student. > >>> > >>> However, there are other
> factors beyond history that sort of necessitate this duplication of desired
> IDs: (skip the two bullets below if you'd rather cut to the chase of the
> work to be done, and discuss design issues later) > >>> > >>> - The actual
> XML import code is a black box: this is the "canned" JAXB library actually
> in action, and not our code at all.  Plus, the XML import code really does
> not filter (nor should it), since the goal of the XML->relational database
> step is to fully capture the XML data in the relational database.  So, XML
> count is necessarily separated from XML import. > >>> > >>> - The notion
> of a declarative mechanism for extracting IDs from the relational database
> (which is what gmbuilder.properties/TallyEngine uses) is interesting, but at
> the same time there is value in the arbitrary computation that can be done
> with Java (case in point: export two versions of an ID, with and without
> periods).  This is not to say that it is impossible to do this
> declaratively, but let's just say that the procedural approach exists here
> and now, and a declarative approach will need more thought. > >>> > >>>
> These, and other factors, are good thoughts to hold onto and would be worthy
> of a good meeting discussion sometime, but bottom line for now: modifying
> the export behavior is a matter of editing the *SpeciesProfile* Java code,
> and not the gmbuilder.properties file.  Turn your attention to that code. >
> >>> > >>> Now, as to annotating your code...I'd just put in code comments
> :)  Or did you mean something else by tagging code in e-mail? > >>> > >>>
> John David N. Dionisio, PhD > >>> Associate Professor, Computer Science >
> >>> Loyola Marymount University > >>> > >>> > >>> > >>> > >>> On Feb 21,
> 2011, at 12:38 AM, Richard Brous wrote: > >>> > >>> > also, how do I tag
> code in email so it holds its formatting? I tried a few suggestions I found
> on the web but they aren't holding formatting or i'm just doing it wrong ;-D
> > >>> > > >>> > Richard > >>> > > >>> > On Sun, Feb 20, 2011 at 9:35 PM,
> Richard Brous <rbr...@gm...> wrote: > >>> > OK, have some updates
> and some suggestions: > >>> > > >>> > On Friday Dr. Dahlquist and I sat
> down and reviewed the gene testing report. We verified that XML match does
> indeed find 4066 unique matches - 75 of which are not in the gdb and need to
> be. > >>> > > >>> > Dr. Dahlquist informed me that she was the one who
> completed the gene db testing report, not a previous student of BIO367 and
> had already verified which genes were missing and where they were to be
> found. I had (mistakenly) assumed that since a student had performed the
> gene database testing I had to redo all of the verification. > >>> > > >>>
> > So that said, of the 75 genes missing - 69 need to be included and 6
> excluded. > >>> > Per the gene db testing report: "69 of them have an "a",
> "b", or "d" suffix. They are all found in the ORF tag and need to be
> included in the gdb." > >>> > > >>> > To solve this we need to add
> additional search criteria into the M. tuberculosis section in
> gmbuilder.properties below: > >>> > # Mycobacterium tuberculosis > >>> > >
> >>> > mycobacteriumtuberculosis_level_amount= > >>> > > >>> > 1 > >>> > >
> >>> > mycobacteriumtuberculosis_element_level0= > >>> > > >>> >
> uniprot/entry/gene/name&type&ordered locus > >>> > > >>> >
> mycobacteriumtuberculosis_query_level0= > >>> > > >>> > select count(*)
> from genenametype where type = 'ordered locus' and value like 'Rv%'; > >>>
> > > >>> > mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> >
> Ordered Locus > >>> > SOLUTIONS: > >>> > > >>> > 1. So am i correct in my
> understanding that the second line is the query used by TallyEngine to read
> the XML file? If so then this is the issue we need to table for the moment
> until we get the gbd verified and re-released. We will revisit this to
> discover why it is not only reporting incorrectly but also why its added a
> second row of Ordered Locus on the TallyEngine results page. > >>> > > >>>
> > 2. The third line is the SQL query used by postgres during the export from
> XML to gdb. To find and get the ORF tagged genes could we not add the
> following lines and change the count in the first line: > >>> > > >>> > >
> >>> > # Mycobacterium tuberculosis > >>> > > >>> >
> mycobacteriumtuberculosis_level_amount=2 > >>> > > >>> > > >>> >
> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus > >>> >
> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf >
> >>> > > >>> > > >>> > mycobacteriumtuberculosis_query_level0= > >>> > >
> >>> > select count(*) from genenametype where type = 'ordered locus'; >
> >>> > mycobacteriumtuberculosis_query_level1=select count(*) from
> genenametype where type = 'orf'; > >>> > > >>> > > >>> >
> mycobacteriumtuberculosis_table_name_level0= > >>> > > >>> > Ordered Locus
> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus > >>> > >
> >>> >
> ----------------------------------------------------------------------------------------------------------------------------
> > >>> > > >>> > Of course these queries would have be manually verified
> prior to making these changes but this seems like we are moving in the right
> direction. > >>> > > >>> > Richard > >>> > > >>> > > >>> > On Thu, Feb 17,
> 2011 at 7:47 PM, Richard Brous <rbr...@gm...> wrote: > >>> > Just
> got done reading previous email and understand the change in priority. >
> >>> > > >>> > Will work on the missing ID's for now and shelve the the
> TalleyEngine issue for the moment. > >>> > > >>> > Also great about a more
> formalized weekly meeting. I was going to suggest it myself as it has been
> slow going so far as maybe i'm a bit too independent in this independent
> study class =D > >>> > > >>> > Will dig further into the missing ID's
> later tonight and during day tomorrow and report back. > >>> > > >>> >
> Richard > >>> > > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N.
> Dionisio <do...@lm...> wrote: > >>> > Hi Rich, > >>> > > >>> > No
> problem.  The pertinent line you're referring to, for XML, is this, right
> above the line you copied: > >>> > > >>> >
> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus > >>> > > >>> > The slash-separated section is the "path" of XML
> tags leading to the element of interest; then, after the ampersand, is a
> name/value pair for the desired attribute to count.  Note that there is no
> hint of a *content*-based filter (nor is there the capability for one, as
> far as I can tell in the code).  By "content," I mean that we can't specify
> filters based on what's *between* the tags.  We can only go as far as filter
> by attribute value, e.g., type="ordered locus". > >>> > > >>> > But
> anyway, as mentioned in the earlier e-mail, let's have the missing IDs in
> the .gdb take precedence for now.  Please take a look at the tuberculosis,
> A. thaliana, and P. falciparum profiles to get an idea for how the ID output
> can be customized, then let me know if you have any questions or need to
> confirm anything. > >>> > > >>> > John David N. Dionisio, PhD > >>> >
> Associate Professor, Computer Science > >>> > Loyola Marymount University >
> >>> > > >>> > > >>> > > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous
> wrote: > >>> > > >>> > > Sorry been slammed with a programming assignment
> that kept needing continued iteration and it has been all consuming until
> last night. But I did get a chance to work with your comments and review the
> code again with a different mind set. > >>> > > > >>> > > Yes, I examined
> the gmbuilder.properties file ( the query is also in the
> MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned in a
> previous email ) but I don't think I see what you mean regarding the XML
> count. > >>> > > > >>> > > I understood that:
> mycobacteriumtuberculosis_query_level0=select count(*) from genenametype
> where type = 'ordered locus' and value like 'Rv%';  was the db query but
> don't see which is the XML count... or do they share the same query and you
> are saying that XML count doesn't recognize and therefore cannot use the
> 'Rv%' parameter? > >>> > > > >>> > > Richard > >>> > > > >>> > > > >>> > >
> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <
> do...@lm...> wrote: > >>> > > Hi Rich, > >>> > > > >>> > > Sorry for the
> delay.  Had some distractions coming into the weekend. > >>> > > > >>> > >
> You've looked at the code; have you looked at gmbuilder.properties?  (I may
> have mentioned it a few e-mails ago, just as you were starting to dig into
> this) > >>> > > > >>> > > On the copy I have, the M. tuberculosis block
> looks like this (indentation is mine to set it apart): > >>> > > > >>> >
> >        # Mycobacterium tuberculosis > >>> > >
> mycobacteriumtuberculosis_level_amount=1 > >>> > > > >>> > >
> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus > >>> > > > >>> > >
> mycobacteriumtuberculosis_query_level0=select count(*) from genenametype
> where type = 'ordered locus' and value like 'Rv%'; > >>> > > > >>> >
> >        mycobacteriumtuberculosis_table_name_level0=Ordered Locus > >>> >
> > > >>> > > There, I think, is the rub.  Notice that the XML count does
> not filter on RV%.  The SQL query does. > >>> > > > >>> > > Unfortunately,
> I don't think the TallyEngine can include selective filtering in the XML
> counts.  If the need to do selective filtering on XML is necessary, then I
> think we're looking at a new functionality for you to implement (or, if this
> throws things off too much, this may have to be noted somewhere, that the
> XML vs. database counts may be off because the database count is doing some
> text-based filtering but the XML count does not). > >>> > > > >>> > > What
> does xmlpipedb-match say?  That will at least tell you whether the 'RV%'
> count is indeed correct. > >>> > > > >>> > > John David N. Dionisio, PhD >
> >>> > > Associate Professor, Computer Science > >>> > > Loyola Marymount
> University > >>> > > > >>> > > > >>> > > > >>> > > On Feb 11, 2011, at
> 4:52 PM, Richard Brous wrote: > >>> > > > >>> > > > OK here is what I was
> able to put together from the past few hours of code review: > >>> > > > >
> >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java: > >>> > > >
> -reveals that after the 2 System table modifications are made adding species
> name and link, a PreparedStatement is instantiated which builds and calls
> the base query. > >>> > > > > >>> > > > -The base query called is:
> ("SELECT value, type " + "FROM genenametype INNER JOIN entrytype_genetype "
> + "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE
> type = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?") >
> >>> > > > > >>> > > > -So its looking in 'ordered locus' table/column for
> any tuple that starts with Rv (followed by any substring) and
> entrytype_gene_hjid = ? . > >>> > > > The 'like' comparator and % usage
> are clear with the 'type' entrytype_gene_hjid = ? > >>> > > > > >>> > > >
> -To me it seems the query makes sense so the problem is likely elsewhere. >
> >>> > > > > >>> > > > GenMappBuilder.java: > >>> > > > -I found method
> doTallies() at code line 895 which: > >>> > > > Instantiates a
> Configuration called hibernateConfiguration and assigns to it the current
> hibernate configuration > >>> > > > Validates database settings by
> analyzing hibernateConfiguration > >>> > > > Instantiates a CriterionList
> for uniprot and assigns to it TallyType.UNIPROT > >>> > > > Instantiates a
> CriterionList for go and assigns to it TallyType.GO > >>> > > > Determines
> if both xml files exist > >>> > > > Then getTallyResultsXML and
> getTallyResultsDatabase are run on both xml files and their respective
> CriterionList > >>> > > > Results are then formatted for display in a
> table. > >>> > > > > >>> > > > -So enum TallyType which means that they
> are the only valid datatypes which TallyEngine accepts... go to know ... >
> >>> > > > > >>> > > > -Based on the screen shot of Tally Engine it would
> seem that both getTallyResultsXML() and getTallyResultsDatabase() are
> incorrectly returning. Likely due to both using an incorrect query (as we
> previously supposed). But where are the queries?... the more I dig the more
> I think they are in the criterial all the work is done against. > >>> > >
> > > >>> > > > continuing the review: > >>> > > > getTallyResultsXML()
> calls Tally Engine instance method getXmlFileCounts(xmlFile) > >>> > > >
> getTallyResultsDatabase() calls Tally Engine instance method getDbcounts(new
> QueryEngine(hibernateConfiguration) > >>> > > > Both of these instanced
> methods originate from TallyEngine.java... > >>> > > > > >>> > > >
> TallyEngine.java: > >>> > > > > >>> > > > getXmlFileCounts() calls
> digestXmlFile() which instantiates a digester then processes against
> criteria... but this quickly becomes confusing and is hard to follow > >>>
> > > > > >>> > > > getDbcounts() then starts a db session and executes a
> query but then I also get a bit lost with my limited db knowledge. > >>> >
> > > > >>> > > >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >>> > > > > >>> > > > OVERALL I think I'm getting closer to the issues
> but I still feel as if I'm missing some understanding to proceed further.
> Can you pass along some of that Dondi insight and steer me in the right
> direction? =D > >>> > > > > >>> > > > -DB Tally - Not having taken
> databases yet certainly is limiting my ability determine where the
> "criteria" are being set and how they are followed during session
> activities. Also is the query we have been looking for this whole time in
> the criteria or someplace else? > >>> > > > > >>> > > > -XML Tally - again
> is the query contained within the criteria that digestXmlFile() uses to
> parse? > >>> > > > > >>> > > > Richard > >>> > > > > >>> > > > > >>> > > >
> On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <do...@lm...>
> wrote: > >>> > > > Right, schema issues are unlikely.  Most count
> discrepancies like this that I've seen have boiled down to forming the right
> query.  Then, knowing the right query (in both XML and SQL), it's a matter
> of making sure that TallyEngine asks that same query. > >>> > > > > >>> >
> > > John David N. Dionisio, PhD > >>> > > > Associate Professor, Computer
> Science > >>> > > > Loyola Marymount University > >>> > > > > >>> > > > >
> >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote: > >>> > > > >
> >>> > > > > OK, so based on your approach: > >>> > > > > > >>> > > > > 1.
> I'll start with reviewing the queries for xmlpipedb-match and sql queries
> needed for the respective results as you requested. > >>> > > > > > >>> >
> > > > I was also thinking I may need to review the schema from xml into
> postgres but the issue isn't likely a schema error. The error most likely
> lies in how xmlpipedbutils queries the data from xml source and writes to
> the tables what it returns? > >>> > > > > > >>> > > > > 2. I'll review the
> code: trace the entrance of tally engine in the gmbuilder code then follow
> it through the xmlpipedbutils. > >>> > > > > > >>> > > > > Richard > >>> >
> > > > > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N.
> Dionisio <do...@lm...> wrote: > >>> > > > > Just wanted to confirm
> (since I wasn't sure in the first e-mail) --- the XMLPipeDB Utilities source
> code is in trunk/xmlpipedbutils in SourceForge's Subversion repo. > >>> >
> > > > > >>> > > > > John David N. Dionisio, PhD > >>> > > > > Associate
> Professor, Computer Science > >>> > > > > Loyola Marymount University >
> >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > On Feb 5, 2011, at
> 10:02 AM, Richard Brous wrote: > >>> > > > > > >>> > > > > > Hi Dondi, >
> >>> > > > > > > >>> > > > > > So I'm at the point in working with M
> tuberculosis that I was able to exactly reproduce Dr. Dahlquist's
> problematic TallyEngine results. > >>> > > > > > > >>> > > > > > gmb2b60
> Results > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > >
> Now the proverbial question - What next to solve the Ordered Locus
> import/count issue? > >>> > > > > > > >>> > > > > >
> ********************************************** > >>> > > > > > Here is my
> thought process: > >>> > > > > > > >>> > > > > > Step 1: How does the
> import process work at the high level? (obviously correct me if I'm wrong) >
> >>> > > > > > > >>> > > > > > I believe that basically as each XML tag is
> read, it is placed in the proper Postgres table(s) based on some criteria.
> There is also likely some sort of check that each individual tag is in valid
> XML format unless we don't care at this stage (care at export) or maybe the
> parser just skips over and goes on to the next . > >>> > > > > > > >>> > >
> > > > Step 2: What could be the problem? > >>> > > > > > > >>> > > > > >
> Either - > >>> > > > > > a. XML tags are being parsed incorrectly
> (ignored/skipped)? > >>> > > > > > b. Decision criteria of which table
> they should be added to? > >>> > > > > > > >>> > > > > >
> ********************************************** > >>> > > > > > > >>> > > >
> > > I read on the sourceforge wiki: > >>> > > > > > > >>> > > > > >
> XMLPipeDB has a modular architecture with three components that may be used
> separately or together. XSD-to-DB reads an XSD (XML Schema Definition) and
> automatically generates an SQL schema, Java classes, and Hibernate mappings.
> XMLPipeDB Utilities provides functionality for configuring the database,
> importing data, and performing queries. GenMAPP Builder is based on the
> XMLPipeDB Utilities and exports GenMAPP-compatible Gene Databases based on
> data from UniProt and Gene Ontology (GO). > >>> > > > > > > >>> > > > > >
> So I should probably start with the XMLPipeDB Utilities which are where? I
> don't see any in the basic distribution or are they not standalone and
> called from the command line? > >>> > > > > > > >>> > > > > > Thanks! >
> >>> > > > > > > >>> > > > > > Richard
>
>
>
>
>
> ------------------------------------------------------------------------------
> Free Software Download: Index, Search & Analyze Logs and other IT data in
> Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> generated by your applications, servers and devices whether physical,
> virtual
> or in the cloud. Deliver compliance at lower cost and gain new business
> insights. http://p.sf.net/sfu/splunk-dev2dev
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
>
>
>
> ------------------------------------------------------------------------------
> Free Software Download: Index, Search & Analyze Logs and other IT data in
> Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> generated by your applications, servers and devices whether physical,
> virtual
> or in the cloud. Deliver compliance at lower cost and gain new business
> insights. http://p.sf.net/sfu/splunk-dev2dev
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
>
>
>
> ------------------------------------------------------------------------------
> Free Software Download: Index, Search & Analyze Logs and other IT data in
> Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> generated by your applications, servers and devices whether physical,
> virtual
> or in the cloud. Deliver compliance at lower cost and gain new business
> insights. http://p.sf.net/sfu/splunk-dev2dev
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>