Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Here is an export of the genes found using: select * from genenametype where
type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'; and also attached
as a csv file.

   647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2
5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4
647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2
647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2
647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2
647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2
211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3
648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2
648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2
648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2
648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2
648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2
649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2
649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2
649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2
649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2
650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2
174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2
174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3
312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2
331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2
445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2
621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1
622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1
622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1
622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2
622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2
623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2
623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2
623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1
623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1
624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1
624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1
625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2
625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2
626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1
626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1
626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1
626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1
627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1
627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1
627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1
627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1
627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1
627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1
627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1
628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1
629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1
629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1
629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1
630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1
630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2
631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2
632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1
632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1
633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1
633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1
645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2
645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2
645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2
645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2
645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2
646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2
646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2
646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2
646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2
646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2
646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2
647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2
647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2

*****The item of note I see is the gene with the slash separating gene
id's which refer to the same gene.

Richard

On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...> wrote:

> Understood.
>
> I'll check in with Dr. D in the afternoon tomorrow and discuss.
>
> Richard
>
>   On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <do...@lm...>wrote:
>
>> Hi Rich,
>>
>> Addressing the release business first, let's put it this way: if the
>> remaining loose ends can be addressed by tomorrow, we can probably wait
>> until then.  If unexpected snags are encountered, then it would be
>> worthwhile to release whatever you have.
>>
>> With that in mind, considering that you pretty much know the patterns of
>> the IDs that are needed, I think it will only take a little digital forensic
>> work now to figure out exactly which IDs are still needed.  Once you know
>> what those are, you should:
>>
>> 1. Find where they are in the XML file.
>> 2. Knowing the XML location, find the corresponding table in the
>> relational database (table names are generally derived from tag/element
>> names).
>> 3. Knowing the table in the database, write or extend the SpeciesProfile
>> query to retrieve that data.
>>
>> For the ID that must *not* be included, again it's a matter of tracking
>> down what this ID is.  Knowing this straggler, you can then consult with Dr.
>> Dahlquist if this ID is truly a unique one-off, or is representative of a
>> pattern that we'll want to exclude.  Either way, this ID can be omitted by
>> using "not" or "<>" or possibly "not like" or "not ~" (check PostgreSQL
>> where clause syntax to see where the negation can be applied).
>>
>> John David N. Dionisio, PhD
>> Associate Professor, Computer Science
>> Loyola Marymount University
>>
>>
>>
>>  On Feb 22, 2011, at 1:37 AM, Richard Brous wrote:
>>
>> > actually i had a typo (emailing from desktop system but testing on my
>> laptop... typed correctly here but wrong in pgadmin) but the results make
>> much more sense now.
>> >
>> >
>> > select count (*) from genenametype where (type = 'ordered locus' or type
>> = 'ORF') and value like 'Rv%';
>> > returns 4058
>> >
>> > select count (*) from genenametype where (type = 'ordered locus' or type
>> = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
>> > returns 4058
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------------
>> >
>> >
>> > Continuing forward -
>> >
>> > The testing report says that 4066 unique matches exist in XML but 6 of
>> them were eliminated by Dr. D leaving the desired number at 4060.
>> >
>> > So now we are only 2 genes short with the query returning 4058... which
>> is also (conveniently) the sum of the two separate queries of 'ordered
>> locus' and 'ORF' respectively.
>> >
>> >  But recall that Dr. D said that only 69 genes of the missing 75 were
>> tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we expected.
>> Adding that into missing genes puts us 3 short...
>> >
>> > Should I make the changes to the code and export a gdb so that analysis
>> can be done or wait until we work this through further?
>> >
>> > Richard
>> >
>> >
>> >
>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <do...@lm...>
>> wrote:
>> > Hi Rich,
>> >
>> > The second form should have worked actually.  What exactly was the
>> error?
>> >
>> > John David N. Dionisio, PhD
>> > Associate Professor, Computer Science
>> > Loyola Marymount University
>> >
>> >
>> >
>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote:
>> >
>> > > hmm not taking parenthesis where I thought they should go... syntax
>> error
>> > >
>> > > select count (*) from genenametype where type = ('ordered locus' or
>>  'ORF') and value like 'Rv%';
>> > > also tried
>> > > select count (*) from genenametype where (type = 'ordered locus' or
>> type = 'ORF') and value like 'Rv%';
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > > ah yes... i see it...
>> > >
>> > >
>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > > Watch your parentheses: "and" has greater precedence than "or" :)
>> > >
>> > >
>> > > John David N. Dionisio, PhD
>> > > Associate Professor, Computer Science
>> > > Loyola Marymount University
>> > >
>> > >
>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > >
>> > >> OK, so here are my query results from raw SQL:
>> > >>
>> > >> 1. using: like 'Rv%'
>> > >>
>> > >> select count (*) from genenametype where type = 'ordered locus' and
>> value like 'Rv%';
>> > >> returns 3988
>> > >>
>> > >> select count (*) from genenametype where type = 'ORF' and value like
>> 'Rv%';
>> > >> returns 70
>> > >>
>> > >> select count (*) from genenametype where type = 'ordered locus' or
>> type = 'ORF' and value like 'Rv%';
>> > >> returns 7011
>> > >>
>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'
>> > >>
>> > >> select count (*) from genenametype where type = 'ordered locus' and
>> value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
>> > >> returns 3988
>> > >>
>> > >> select count (*) from genenametype where type = 'ordered locus' or
>> type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
>> > >> returns 7011
>> > >>
>> > >> select count (*) from genenametype where type = 'ORF' and value ~
>> '[Rr][Vv][0-9][0-9][0-9][0-9]*';
>> > >> returns 70
>> > >>
>> > >> Conclusions:
>> > >>
>> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69
>> genes were were looking for plus one more (maybe the count for missing genes
>> is off by 1?).
>> > >>
>> > >> 2. Combining the two types in a single query did not produce the
>> results that I expected (7011? - how did that happen????) so this is likely
>> not our solution... unless of course the query syntax isn't actually doing
>> what I think it is...
>> > >>
>> > >> 3. I would think the best course of action is to serialy run two
>> separate queries to capture all the required genes, then removing the one
>> unneeded gene if its truly not wanted.
>> > >>
>> > >> What do you think?
>> > >>
>> > >> Richard
>> > >>
>> > >>
>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >> I don't recall the exact details of the missing 69, but if your query
>> successfully returns them in raw SQL, then this is worth a try.  You can
>> integrate into the same query as long as the same columns are returned,
>> which is the case here AFAIK, so go ahead and extend the existing query.
>> > >>
>> > >>
>> > >> John David N. Dionisio, PhD
>> > >> Associate Professor, Computer Science
>> > >> Loyola Marymount University
>> > >>
>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > >>
>> > >>> So here is the appropriate code snippet from
>> MycobacteriumTuberculosisUniProtSpeciesProfile.java:
>> > >>> public
>> > >>>
>> > >>> TableManager getSystemTableManagerCustomizations(TableManager
>> tableManager, TableManager primarySystemTableManager, Date version) throws
>> SQLException, InvalidParameterException {
>> > >>>
>> > >>> // Build the base query; we only use "ordered locus" and we only
>> want
>> > >>>
>> > >>> // IDs that begin with "Rv."
>> > >>> PreparedStatement ps =
>> ConnectionManager.getRelationalDBConnection().prepareStatement(
>> > >>>
>> > >>> "SELECT value, type " +
>> > >>>
>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " +
>> > >>>
>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) " +
>> > >>>
>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and
>> entrytype_gene_hjid = ?");
>> > >>> ResultSet result;
>> > >>>
>> > >>>
>> > >>>
>> > >>> for (Row row : primarySystemTableManager.getRows()) {
>> > >>> ps.setInt(1, Integer.parseInt(row.getValue(
>> > >>>
>> > >>> "UID")));
>> > >>> result = ps.executeQuery();
>> > >>>
>> > >>>
>> > >>>
>> > >>> // We actually want to keep the case where multiple ordered locus
>> > >>>
>> > >>> // names appear.
>> > >>>
>> > >>> while (result.next()) {
>> > >>>
>> > >>> // We want this name to appear in the OrderedLocusNames
>> > >>>
>> > >>> // system table.
>> > >>>
>> > >>> for (String id : result.getString("value").split("/")) {
>> > >>> tableManager.submit(
>> > >>>
>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id
>> }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"",
>> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID",
>> row.getValue("UID") } });
>> > >>> }
>> > >>>
>> > >>> }
>> > >>>
>> > >>> }
>> > >>>
>> > >>>
>>  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > >>> So now we want to build the base query which uses "ordered locus"
>> and "orf" and we only want IDs that begin with "Rv".
>> > >>>
>> > >>> I know there are more comprehensive ways to search for gene ID's by
>> matching gene ID prefix but "like Rv%" seemed to work thus far, we just need
>> to tell it to search for XML tag type orf in addition to ordered locus.
>> > >>>
>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like 'Rv%'
>> and entrytype_gene_hjid = ? "
>> > >>>
>> > >>> Here is a stab at it.... This part of our class was right as the
>> server went down and my submission for week 6 assignment I can't seem to
>> find.
>> > >>>
>> > >>> Is it possible to have two different types in the same query or
>> should we rewrite a separate query for the orf tag?
>> > >>>
>> > >>> Richard
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <rbr...@gm...>
>> wrote:
>> > >>>
>> > >>> thanks and will do as directed.
>> > >>>
>> > >>> My previous, last paragraph comment - A way for programming code in
>> email holding its format in a mail message similarly to how you can post
>> code on forum pages?
>> > >>>
>> > >>> <code>
>> > >>> blah
>> > >>> blah
>> > >>> blah
>> > >>> </code>
>> > >>>
>> > >>> thanks!
>> > >>>
>> > >>> Richard
>> > >>>
>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >>>
>> > >>> Greetings,
>> > >>>
>> > >>> Actually, gmbuilder.properties is for the TallyEngine only.  When
>> dealing with .gdb exports, look *only* at the SpeciesProfile class.  So, to
>> find those 69 IDs, it is the SpeciesProfile code, and *only* the
>> SpeciesProfile code, that needs to be changed.
>> > >>>
>> > >>> Your take on how gmbuilder.properties is used, however, is
>> understandable.  It makes sense to assume that the TallyEngine code *and*
>> the ID export code are based on the same characterization of the needed IDs.
>>  This replication is originally a historical artifact: SpeciesProfile was
>> done first, and then TallyEngine was done later by another student.
>> > >>>
>> > >>> However, there are other factors beyond history that sort of
>> necessitate this duplication of desired IDs: (skip the two bullets below if
>> you'd rather cut to the chase of the work to be done, and discuss design
>> issues later)
>> > >>>
>> > >>> - The actual XML import code is a black box: this is the "canned"
>> JAXB library actually in action, and not our code at all.  Plus, the XML
>> import code really does not filter (nor should it), since the goal of the
>> XML->relational database step is to fully capture the XML data in the
>> relational database.  So, XML count is necessarily separated from XML
>> import.
>> > >>>
>> > >>> - The notion of a declarative mechanism for extracting IDs from the
>> relational database (which is what gmbuilder.properties/TallyEngine uses) is
>> interesting, but at the same time there is value in the arbitrary
>> computation that can be done with Java (case in point: export two versions
>> of an ID, with and without periods).  This is not to say that it is
>> impossible to do this declaratively, but let's just say that the procedural
>> approach exists here and now, and a declarative approach will need more
>> thought.
>> > >>>
>> > >>> These, and other factors, are good thoughts to hold onto and would
>> be worthy of a good meeting discussion sometime, but bottom line for now:
>> modifying the export behavior is a matter of editing the *SpeciesProfile*
>> Java code, and not the gmbuilder.properties file.  Turn your attention to
>> that code.
>> > >>>
>> > >>> Now, as to annotating your code...I'd just put in code comments  :)
>>  Or did you mean something else by tagging code in e-mail?
>> > >>>
>> > >>> John David N. Dionisio, PhD
>> > >>> Associate Professor, Computer Science
>> > >>> Loyola Marymount University
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote:
>> > >>>
>> > >>> > also, how do I tag code in email so it holds its formatting? I
>> tried a few suggestions I found on the web but they aren't holding
>> formatting or i'm just doing it wrong ;-D
>> > >>> >
>> > >>> > Richard
>> > >>> >
>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <
>> rbr...@gm...> wrote:
>> > >>> > OK, have some updates and some suggestions:
>> > >>> >
>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene
>> testing report. We verified that XML match does indeed find 4066 unique
>> matches - 75 of which are not in the gdb and need to be.
>> > >>> >
>> > >>> > Dr. Dahlquist informed me that she was the one who completed the
>> gene db testing report, not a previous student of BIO367 and had already
>> verified which genes were missing and where they were to be found. I had
>> (mistakenly) assumed that since a student had performed the gene database
>> testing I had to redo all of the verification.
>> > >>> >
>> > >>> > So that said, of the 75 genes missing - 69 need to be included and
>> 6 excluded.
>> > >>> > Per the gene db testing report: "69 of them have an "a", "b", or
>> "d" suffix. They are all found in the ORF tag and need to be included in the
>> gdb."
>> > >>> >
>> > >>> > To solve this we need to add additional search criteria into the
>> M. tuberculosis section in gmbuilder.properties below:
>> > >>> > # Mycobacterium tuberculosis
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_level_amount=
>> > >>> >
>> > >>> > 1
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_element_level0=
>> > >>> >
>> > >>> > uniprot/entry/gene/name&type&ordered locus
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_query_level0=
>> > >>> >
>> > >>> > select count(*) from genenametype where type = 'ordered locus' and
>> value like 'Rv%';
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_table_name_level0=
>> > >>> >
>> > >>> > Ordered Locus
>> > >>> > SOLUTIONS:
>> > >>> >
>> > >>> > 1. So am i correct in my understanding that the second line is the
>> query used by TallyEngine to read the XML file? If so then this is the issue
>> we need to table for the moment until we get the gbd verified and
>> re-released. We will revisit this to discover why it is not only reporting
>> incorrectly but also why its added a second row of Ordered Locus on the
>> TallyEngine results page.
>> > >>> >
>> > >>> > 2. The third line is the SQL query used by postgres during the
>> export from XML to gdb. To find and get the ORF tagged genes could we not
>> add the following lines and change the count in the first line:
>> > >>> >
>> > >>> >
>> > >>> > # Mycobacterium tuberculosis
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_level_amount=2
>> > >>> >
>> > >>> >
>> > >>> >
>> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> > >>> >
>> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf
>> > >>> >
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_query_level0=
>> > >>> >
>> > >>> > select count(*) from genenametype where type = 'ordered locus';
>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from
>> genenametype where type = 'orf';
>> > >>> >
>> > >>> >
>> > >>> > mycobacteriumtuberculosis_table_name_level0=
>> > >>> >
>> > >>> > Ordered Locus
>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus
>> > >>> >
>> > >>> >
>> ----------------------------------------------------------------------------------------------------------------------------
>> > >>> >
>> > >>> > Of course these queries would have be manually verified prior to
>> making these changes but this seems like we are moving in the right
>> direction.
>> > >>> >
>> > >>> > Richard
>> > >>> >
>> > >>> >
>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <
>> rbr...@gm...> wrote:
>> > >>> > Just got done reading previous email and understand the change in
>> priority.
>> > >>> >
>> > >>> > Will work on the missing ID's for now and shelve the the
>> TalleyEngine issue for the moment.
>> > >>> >
>> > >>> > Also great about a more formalized weekly meeting. I was going to
>> suggest it myself as it has been slow going so far as maybe i'm a bit too
>> independent in this independent study class =D
>> > >>> >
>> > >>> > Will dig further into the missing ID's later tonight and during
>> day tomorrow and report back.
>> > >>> >
>> > >>> > Richard
>> > >>> >
>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >>> > Hi Rich,
>> > >>> >
>> > >>> > No problem.  The pertinent line you're referring to, for XML, is
>> this, right above the line you copied:
>> > >>> >
>> > >>> >
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> > >>> >
>> > >>> > The slash-separated section is the "path" of XML tags leading to
>> the element of interest; then, after the ampersand, is a name/value pair for
>> the desired attribute to count.  Note that there is no hint of a
>> *content*-based filter (nor is there the capability for one, as far as I can
>> tell in the code).  By "content," I mean that we can't specify filters based
>> on what's *between* the tags.  We can only go as far as filter by attribute
>> value, e.g., type="ordered locus".
>> > >>> >
>> > >>> > But anyway, as mentioned in the earlier e-mail, let's have the
>> missing IDs in the .gdb take precedence for now.  Please take a look at the
>> tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for how
>> the ID output can be customized, then let me know if you have any questions
>> or need to confirm anything.
>> > >>> >
>> > >>> > John David N. Dionisio, PhD
>> > >>> > Associate Professor, Computer Science
>> > >>> > Loyola Marymount University
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote:
>> > >>> >
>> > >>> > > Sorry been slammed with a programming assignment that kept
>> needing continued iteration and it has been all consuming until last night.
>> But I did get a chance to work with your comments and review the code again
>> with a different mind set.
>> > >>> > >
>> > >>> > > Yes, I examined the gmbuilder.properties file ( the query is
>> also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I mentioned
>> in a previous email ) but I don't think I see what you mean regarding the
>> XML count.
>> > >>> > >
>> > >>> > > I understood that: mycobacteriumtuberculosis_query_level0=select
>> count(*) from genenametype where type = 'ordered locus' and value like
>> 'Rv%';  was the db query but don't see which is the XML count... or do they
>> share the same query and you are saying that XML count doesn't recognize and
>> therefore cannot use the 'Rv%' parameter?
>> > >>> > >
>> > >>> > > Richard
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >>> > > Hi Rich,
>> > >>> > >
>> > >>> > > Sorry for the delay.  Had some distractions coming into the
>> weekend.
>> > >>> > >
>> > >>> > > You've looked at the code; have you looked at
>> gmbuilder.properties?  (I may have mentioned it a few e-mails ago, just as
>> you were starting to dig into this)
>> > >>> > >
>> > >>> > > On the copy I have, the M. tuberculosis block looks like this
>> (indentation is mine to set it apart):
>> > >>> > >
>> > >>> > >        # Mycobacterium tuberculosis
>> > >>> > >        mycobacteriumtuberculosis_level_amount=1
>> > >>> > >
>> > >>> > >
>>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
>> locus
>> > >>> > >
>> > >>> > >        mycobacteriumtuberculosis_query_level0=select count(*)
>> from genenametype where type = 'ordered locus' and value like 'Rv%';
>> > >>> > >
>> > >>> > >        mycobacteriumtuberculosis_table_name_level0=Ordered Locus
>> > >>> > >
>> > >>> > > There, I think, is the rub.  Notice that the XML count does not
>> filter on RV%.  The SQL query does.
>> > >>> > >
>> > >>> > > Unfortunately, I don't think the TallyEngine can include
>> selective filtering in the XML counts.  If the need to do selective
>> filtering on XML is necessary, then I think we're looking at a new
>> functionality for you to implement (or, if this throws things off too much,
>> this may have to be noted somewhere, that the XML vs. database counts may be
>> off because the database count is doing some text-based filtering but the
>> XML count does not).
>> > >>> > >
>> > >>> > > What does xmlpipedb-match say?  That will at least tell you
>> whether the 'RV%' count is indeed correct.
>> > >>> > >
>> > >>> > > John David N. Dionisio, PhD
>> > >>> > > Associate Professor, Computer Science
>> > >>> > > Loyola Marymount University
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote:
>> > >>> > >
>> > >>> > > > OK here is what I was able to put together from the past few
>> hours of code review:
>> > >>> > > >
>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java:
>> > >>> > > > -reveals that after the 2 System table modifications are made
>> adding species name and link, a PreparedStatement is instantiated which
>> builds and calls the base query.
>> > >>> > > >
>> > >>> > > > -The base query called is: ("SELECT value, type " + "FROM
>> genenametype INNER JOIN entrytype_genetype " +
>> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type
>> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?")
>> > >>> > > >
>> > >>> > > > -So its looking in 'ordered locus' table/column for any tuple
>> that starts with Rv (followed by any substring) and entrytype_gene_hjid = ?
>> .
>> > >>> > > > The 'like' comparator and % usage are clear with the 'type'
>> entrytype_gene_hjid = ?
>> > >>> > > >
>> > >>> > > > -To me it seems the query makes sense so the problem is likely
>> elsewhere.
>> > >>> > > >
>> > >>> > > > GenMappBuilder.java:
>> > >>> > > > -I found method doTallies() at code line 895 which:
>> > >>> > > > Instantiates a Configuration called hibernateConfiguration and
>> assigns to it the current hibernate configuration
>> > >>> > > > Validates database settings by analyzing
>> hibernateConfiguration
>> > >>> > > > Instantiates a CriterionList for uniprot and assigns to it
>> TallyType.UNIPROT
>> > >>> > > > Instantiates a CriterionList for go and assigns to it
>> TallyType.GO
>> > >>> > > > Determines if both xml files exist
>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are run on
>> both xml files and their respective CriterionList
>> > >>> > > > Results are then formatted for display in a table.
>> > >>> > > >
>> > >>> > > > -So enum TallyType which means that they are the only valid
>> datatypes which TallyEngine accepts... go to know ...
>> > >>> > > >
>> > >>> > > > -Based on the screen shot of Tally Engine it would seem that
>> both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly
>> returning. Likely due to both using an incorrect query (as we previously
>> supposed). But where are the queries?... the more I dig the more I think
>> they are in the criterial all the work is done against.
>> > >>> > > >
>> > >>> > > > continuing the review:
>> > >>> > > > getTallyResultsXML() calls Tally Engine instance method
>> getXmlFileCounts(xmlFile)
>> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance method
>> getDbcounts(new QueryEngine(hibernateConfiguration)
>> > >>> > > > Both of these instanced methods originate from
>> TallyEngine.java...
>> > >>> > > >
>> > >>> > > > TallyEngine.java:
>> > >>> > > >
>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which instantiates a
>> digester then processes against criteria... but this quickly becomes
>> confusing and is hard to follow
>> > >>> > > >
>> > >>> > > > getDbcounts() then starts a db session and executes a query
>> but then I also get a bit lost with my limited db knowledge.
>> > >>> > > >
>> > >>> > > >
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > >>> > > >
>> > >>> > > > OVERALL I think I'm getting closer to the issues but I still
>> feel as if I'm missing some understanding to proceed further. Can you pass
>> along some of that Dondi insight and steer me in the right direction? =D
>> > >>> > > >
>> > >>> > > > -DB Tally - Not having taken databases yet certainly is
>> limiting my ability determine where the "criteria" are being set and how
>> they are followed during session activities. Also is the query we have been
>> looking for this whole time in the criteria or someplace else?
>> > >>> > > >
>> > >>> > > > -XML Tally - again is the query contained within the criteria
>> that digestXmlFile() uses to parse?
>> > >>> > > >
>> > >>> > > > Richard
>> > >>> > > >
>> > >>> > > >
>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >>> > > > Right, schema issues are unlikely.  Most count discrepancies
>> like this that I've seen have boiled down to forming the right query.  Then,
>> knowing the right query (in both XML and SQL), it's a matter of making sure
>> that TallyEngine asks that same query.
>> > >>> > > >
>> > >>> > > > John David N. Dionisio, PhD
>> > >>> > > > Associate Professor, Computer Science
>> > >>> > > > Loyola Marymount University
>> > >>> > > >
>> > >>> > > >
>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote:
>> > >>> > > >
>> > >>> > > > > OK, so based on your approach:
>> > >>> > > > >
>> > >>> > > > > 1. I'll start with reviewing the queries for xmlpipedb-match
>> and sql queries needed for the respective results as you requested.
>> > >>> > > > >
>> > >>> > > > > I was also thinking I may need to review the schema from xml
>> into postgres but the issue isn't likely a schema error. The error most
>> likely lies in how xmlpipedbutils queries the data from xml source and
>> writes to the tables what it returns?
>> > >>> > > > >
>> > >>> > > > > 2. I'll review the code: trace the entrance of tally engine
>> in the gmbuilder code then follow it through the xmlpipedbutils.
>> > >>> > > > >
>> > >>> > > > > Richard
>> > >>> > > > >
>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio <
>> do...@lm...> wrote:
>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the first
>> e-mail) --- the XMLPipeDB Utilities source code is in trunk/xmlpipedbutils
>> in SourceForge's Subversion repo.
>> > >>> > > > >
>> > >>> > > > > John David N. Dionisio, PhD
>> > >>> > > > > Associate Professor, Computer Science
>> > >>> > > > > Loyola Marymount University
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote:
>> > >>> > > > >
>> > >>> > > > > > Hi Dondi,
>> > >>> > > > > >
>> > >>> > > > > > So I'm at the point in working with M tuberculosis that I
>> was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine
>> results.
>> > >>> > > > > >
>> > >>> > > > > > gmb2b60 Results
>> > >>> > > > > >
>> > >>> > > > > >
>> > >>> > > > > >
>> > >>> > > > > > Now the proverbial question - What next to solve the
>> Ordered Locus import/count issue?
>> > >>> > > > > >
>> > >>> > > > > > **********************************************
>> > >>> > > > > > Here is my thought process:
>> > >>> > > > > >
>> > >>> > > > > > Step 1: How does the import process work at the high
>> level? (obviously correct me if I'm wrong)
>> > >>> > > > > >
>> > >>> > > > > > I believe that basically as each XML tag is read, it is
>> placed in the proper Postgres table(s) based on some criteria. There is also
>> likely some sort of check that each individual tag is in valid XML format
>> unless we don't care at this stage (care at export) or maybe the parser just
>> skips over and goes on to the next .
>> > >>> > > > > >
>> > >>> > > > > > Step 2: What could be the problem?
>> > >>> > > > > >
>> > >>> > > > > > Either -
>> > >>> > > > > > a. XML tags are being parsed incorrectly
>> (ignored/skipped)?
>> > >>> > > > > > b. Decision criteria of which table they should be added
>> to?
>> > >>> > > > > >
>> > >>> > > > > > **********************************************
>> > >>> > > > > >
>> > >>> > > > > > I read on the sourceforge wiki:
>> > >>> > > > > >
>> > >>> > > > > > XMLPipeDB has a modular architecture with three components
>> that may be used separately or together. XSD-to-DB reads an XSD (XML Schema
>> Definition) and automatically generates an SQL schema, Java classes, and
>> Hibernate mappings. XMLPipeDB Utilities provides functionality for
>> configuring the database, importing data, and performing queries. GenMAPP
>> Builder is based on the XMLPipeDB Utilities and exports GenMAPP-compatible
>> Gene Databases based on data from UniProt and Gene Ontology (GO).
>> > >>> > > > > >
>> > >>> > > > > > So I should probably start with the XMLPipeDB Utilities
>> which are where? I don't see any in the basic distribution or are they not
>> standalone and called from the command line?
>> > >>> > > > > >
>> > >>> > > > > > Thanks!
>> > >>> > > > > >
>> > >>> > > > > > Richard
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > <ATT00001..txt><ATT00002..txt>
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> ------------------------------------------------------------------------------
>> > >>> > > > The ultimate all-in-one performance toolkit: Intel(R) Parallel
>> Studio XE:
>> > >>> > > > Pinpoint memory and threading errors before they happen.
>> > >>> > > > Find and fix more than 250 security defects in the development
>> cycle.
>> > >>> > > > Locate bottlenecks in serial and parallel code that limit
>> performance.
>> > >>> > > > http://p.sf.net/sfu/intel-dev2devfeb
>> > >>> > > > _______________________________________________
>> > >>> > > > xmlpipedb-developer mailing list
>> > >>> > > > xml...@li...
>> > >>> > > >
>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>> > > >
>> > >>> > > > <ATT00001..txt><ATT00002..txt>
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> ------------------------------------------------------------------------------
>> > >>> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel
>> Studio XE:
>> > >>> > > Pinpoint memory and threading errors before they happen.
>> > >>> > > Find and fix more than 250 security defects in the development
>> cycle.
>> > >>> > > Locate bottlenecks in serial and parallel code that limit
>> performance.
>> > >>> > > http://p.sf.net/sfu/intel-dev2devfeb
>> > >>> > > _______________________________________________
>> > >>> > > xmlpipedb-developer mailing list
>> > >>> > > xml...@li...
>> > >>> > >
>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>> > >
>> > >>> > > <ATT00001..txt><ATT00002..txt>
>> > >>> >
>> > >>> >
>> > >>> >
>> ------------------------------------------------------------------------------
>> > >>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel
>> Studio XE:
>> > >>> > Pinpoint memory and threading errors before they happen.
>> > >>> > Find and fix more than 250 security defects in the development
>> cycle.
>> > >>> > Locate bottlenecks in serial and parallel code that limit
>> performance.
>> > >>> > http://p.sf.net/sfu/intel-dev2devfeb
>> > >>> > _______________________________________________
>> > >>> > xmlpipedb-developer mailing list
>> > >>> > xml...@li...
>> > >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > <ATT00001..txt><ATT00002..txt>
>> > >>>
>> > >>>
>> > >>>
>> ------------------------------------------------------------------------------
>> > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel
>> Studio XE:
>> > >>> Pinpoint memory and threading errors before they happen.
>> > >>> Find and fix more than 250 security defects in the development
>> cycle.
>> > >>> Locate bottlenecks in serial and parallel code that limit
>> performance.
>> > >>> http://p.sf.net/sfu/intel-dev2devfeb
>> > >>> _______________________________________________
>> > >>> xmlpipedb-developer mailing list
>> > >>> xml...@li...
>> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> ------------------------------------------------------------------------------
>> > >>> Index, Search & Analyze Logs and other IT data in Real-Time with
>> Splunk
>> > >>> Collect, index and harness all the fast moving IT data generated by
>> your
>> > >>> applications, servers and devices whether physical, virtual or in
>> the cloud.
>> > >>> Deliver compliance at lower cost and gain new business insights.
>> > >>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> > >>> _______________________________________________
>> > >>> xmlpipedb-developer mailing list
>> > >>> xml...@li...
>> > >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>
>> > >>
>> ------------------------------------------------------------------------------
>> > >> Index, Search & Analyze Logs and other IT data in Real-Time with
>> Splunk
>> > >> Collect, index and harness all the fast moving IT data generated by
>> your
>> > >> applications, servers and devices whether physical, virtual or in the
>> cloud.
>> > >> Deliver compliance at lower cost and gain new business insights.
>> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> > >> _______________________________________________
>> > >> xmlpipedb-developer mailing list
>> > >> xml...@li...
>> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >>
>> > >>
>> > >>
>> ------------------------------------------------------------------------------
>> > >> Index, Search & Analyze Logs and other IT data in Real-Time with
>> Splunk
>> > >> Collect, index and harness all the fast moving IT data generated by
>> your
>> > >> applications, servers and devices whether physical, virtual or in the
>> cloud.
>> > >> Deliver compliance at lower cost and gain new business insights.
>> > >> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> > >> _______________________________________________
>> > >> xmlpipedb-developer mailing list
>> > >> xml...@li...
>> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >
>> > >
>> ------------------------------------------------------------------------------
>> > > Index, Search & Analyze Logs and other IT data in Real-Time with
>> Splunk
>> > > Collect, index and harness all the fast moving IT data generated by
>> your
>> > > applications, servers and devices whether physical, virtual or in the
>> cloud.
>> > > Deliver compliance at lower cost and gain new business insights.
>> > > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> > > _______________________________________________
>> > > xmlpipedb-developer mailing list
>> > > xml...@li...
>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> > >
>> > >
>> > >
>> > > <ATT00001..txt><ATT00002..txt>
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Index, Search & Analyze Logs and other IT data in Real-Time with Splunk
>> > Collect, index and harness all the fast moving IT data generated by your
>> > applications, servers and devices whether physical, virtual or in the
>> cloud.
>> > Deliver compliance at lower cost and gain new business insights.
>> > Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> > _______________________________________________
>> > xmlpipedb-developer mailing list
>> > xml...@li...
>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>> >
>> > <ATT00001..txt><ATT00002..txt>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Index, Search & Analyze Logs and other IT data in Real-Time with Splunk
>> Collect, index and harness all the fast moving IT data generated by your
>> applications, servers and devices whether physical, virtual or in the
>> cloud.
>> Deliver compliance at lower cost and gain new business insights.
>> Free Software Download: http://p.sf.net/sfu/splunk-dev2dev
>> _______________________________________________
>> xmlpipedb-developer mailing list
>> xml...@li...
>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>
>
>