Re: [XMLPipeDB-developer] 499 - PROBLEM - M tuberculosis xml tag importation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Digging into this after my morning class. Will send out another update later
today.

Richard

On Wed, Feb 23, 2011 at 12:01 PM, John David N. Dionisio <do...@lm...>wrote:

> Greetings,
>
> Agreed on all count.  And yes, it appears that the slash-processing (for
> cases where we want to keep both IDs) has already been taken care of as a
> default behavior, so that is great.
>
> I agree that a specific exclusion of Rv3346/55c will now be sufficient.
>  Some form of negation predicate like "not" or "<>" should do the trick.
>
> John David N. Dionisio, PhD
> Associate Professor, Computer Science
> Loyola Marymount University
>
>
>
>  On Feb 23, 2011, at 2:46 PM, Richard Brous wrote:
>
> > Great!
> >
> > Now the thumbs up from Dondi and we can get cracking. I'll dig into sql
> syntax for negation of a specific tuple while we wait for his response.
> >
> > Richard
> >
> > On Wed, Feb 23, 2011 at 11:37 AM, Kam Dahlquist <kda...@lm...>
> wrote:
> > Hi,
> >
> > That sounds good to me.
> >
> > Best,
> > Dr. D
> >
> >
> > At 11:16 AM 2/23/2011, you wrote:
> >> Hmm... possible that the method to split up records has been globally
> implemented for export from Postgres?
> >>
> >> Since the records are automatically splitting when exported from
> postgres to gdb file, all we need to do is exclude Rv3346/55c 'ORF' from
> "select count (*) from genenametype where (type = 'ordered locus' or type =
> 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';" and we should be set.
> >>
> >> That seem correct Dondi?
> >>
> >> Richard
> >>
> >> On Wed, Feb 23, 2011 at 11:04 AM, Kam Dahlquist <kda...@lm...>
> wrote:
> >> Hi,
> >>
> >> I checked these out.  They all should be separated and kept as separate
> OrderedLocusNames records.  In fact, the gdb already has them separated into
> individual records, so it is only the tally for Postgres that is off, they
> were correctly exported into the gdb.
> >>
> >> Best,
> >> Dr. D
> >>
> >>
> >> At 07:31 PM 2/22/2011, Richard Brous wrote:
> >>> OK I found the culprits:
> >>>
> >>> There are 3 ordered locus gene ID's that have slashes. As was
> previously discussed it seems XML Match is counting each one twice which
> would inflate the count by 3.
> >>>
> >>> Rv2561/Rv2562       'ordered locus'
> >>> Rv2880c/Rv2879c   'ordered locus'
> >>> Rv3021c/Rv3022c   'ordered locus'
> >>>
> >>> And as was previously noted, Rv3346/55c   'ORF' should be excluded.
> >>>
> >>> This will bring the both XML and sql db queries counts in sync at 4057.
> >>>
> >>> By my math:
> >>>
> ------------------------------------------------------------------------------------------------------------
> >>> XML tags read by Match:
> >>> 4066 unique matches - 6 genes excluded by Dr. D - 3 duplicates caused
> by slashes = total 4057 genes
> >>>
> ------------------------------------------------------------------------------------------------------------
> >>> sql query of the db tables:
> >>> 3988 ordered locus + 69 ORF = total 4057 genes
> >>>
> ------------------------------------------------------------------------------------------------------------
> >>>
> >>> TO MOVE FORWARD:
> >>>
> >>> 1. We need to decide how to address the slash ID's (should we keep and
> split into separate tuples or should they be excluded)
> >>>
> >>> 2. Adjust the queries to reflect what is needed and test with raw sql.
> >>>
> >>> 3. Update the queries on
> MycobacteriumTuberculosisUniProtSpeciesProfile.java
> >>>
> >>> 4. Export a new Mtb gene database for testing.
> >>>
> >>>
> >>> Richard
> >>>
> >>>
> >>>
> >>> On Tue, Feb 22, 2011 at 5:12 PM, Kam Dahlquist <kda...@lm...>
> wrote:
> >>> Hi,
> >>> I don't know why the numbers are off.  I think the only way to find out
> is to get the results of the Postgres query and line it up next to the match
> results and see what is different.  If you can send me the lists of IDs from
> the Postgres query and the match results, I can check them.
> >>> Best,
> >>> Dr. D
> >>>
> >>> At 04:57 PM 2/22/2011, you wrote:
> >>>> Here is where I am on the numbers:
> >>>>
> ------------------------------------------------------------------------------------------------------------
> >>>> XML tags read by Match:
> >>>> 4066 unique matches - 6 genes excluded by Dr. D = total 4060 genes
> >>>>
> ------------------------------------------------------------------------------------------------------------
> >>>> sql query of the db tables:
> >>>> 3988 ordered locus + 69 ORF = total 4057 genes
> >>>>
> ------------------------------------------------------------------------------------------------------------
> >>>> So by my count we are off by 3 genes
> >>>>
> >>>> Maybe XML Match is counting slashed genes as 2 separate genes?
> >>>>
> >>>> IE. so if it encountered 3 slashed genes it would in effect be
> counting those 3 as 6?
> >>>>
> >>>> Richard
> >>>> On Tue, Feb 22, 2011 at 3:06 PM, Kam Dahlquist <kda...@lm...>
> wrote:
> >>>> Hi,
> >>>> I compared this list of gene IDs with the list I had on the Testing
> Report on the wiki and found the following.
> >>>> Your list has 70 IDs.  69 of them are identical to what is in my list
> of IDs in the Testing report.
> >>>> Indeed, the only difference is the one with the slash.  "Rv3346/55c"
>  I looked this ID up at UniProt.org as Rv3346.  It appears in the record for
> UniProt ID O50384.  For that UniProt record, the ID referred to in the
> OrderedLocus tag is Rv3355c which does appear in the gdb already.  Looking
> up Rv3346 on Tuberculist and the Stanford TB database, Rv3346 is not a real
> gene ID.  So this entire record with the slash in it needs to be excluded
> and all the rest of them need to be captured.
> >>>> I think then, that the numbers should add up correctly, is this true?
> >>>> Best,
> >>>> Dr. D
> >>>> At 01:09 PM 2/22/2011, Richard Brous wrote:
> >>>>> attachement now included...
> >>>>> On Tue, Feb 22, 2011 at 1:09 PM, Richard Brous <rbr...@gm...>
> wrote:
> >>>>> Here is an export of the genes found using: select * from
> genenametype where type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
> and also attached as a csv file.
> >>>>>
> >>>>>
> 647412|"org.uniprot.uniprot.GeneNameType"|0|"Rv1990A"|"ORF"|""|647409|2
> >>>>> 5297|"org.uniprot.uniprot.GeneNameType"|0|"Rv2922A"|"ORF"|""|5292|4
> >>>>>
> 647553|"org.uniprot.uniprot.GeneNameType"|0|"Rv1638A"|"ORF"|""|647550|2
> >>>>>
> 647679|"org.uniprot.uniprot.GeneNameType"|0|"Rv1507A"|"ORF"|""|647676|2
> >>>>>
> 647804|"org.uniprot.uniprot.GeneNameType"|0|"Rv1498A"|"ORF"|""|647801|2
> >>>>>
> 647944|"org.uniprot.uniprot.GeneNameType"|0|"Rv1489A"|"ORF"|""|647941|2
> >>>>>
> 211818|"org.uniprot.uniprot.GeneNameType"|0|"Rv0979A"|"ORF"|""|211814|3
> >>>>>
> 648210|"org.uniprot.uniprot.GeneNameType"|0|"Rv1473A"|"ORF"|""|648207|2
> >>>>>
> 648340|"org.uniprot.uniprot.GeneNameType"|0|"Rv1322A"|"ORF"|""|648337|2
> >>>>>
> 648488|"org.uniprot.uniprot.GeneNameType"|0|"Rv1135A"|"ORF"|""|648485|2
> >>>>>
> 648637|"org.uniprot.uniprot.GeneNameType"|0|"Rv1116A"|"ORF"|""|648634|2
> >>>>>
> 648762|"org.uniprot.uniprot.GeneNameType"|0|"Rv1087A"|"ORF"|""|648759|2
> >>>>>
> 649177|"org.uniprot.uniprot.GeneNameType"|0|"Rv0787A"|"ORF"|""|649174|2
> >>>>>
> 649334|"org.uniprot.uniprot.GeneNameType"|0|"Rv0749A"|"ORF"|""|649331|2
> >>>>>
> 649472|"org.uniprot.uniprot.GeneNameType"|0|"Rv0590A"|"ORF"|""|649469|2
> >>>>>
> 649899|"org.uniprot.uniprot.GeneNameType"|0|"Rv0470A"|"ORF"|""|649896|2
> >>>>>
> 650295|"org.uniprot.uniprot.GeneNameType"|0|"Rv0078A"|"ORF"|""|650292|2
> >>>>>
> 174122|"org.uniprot.uniprot.GeneNameType"|0|"Rv1159A"|"ORF"|""|174119|2
> >>>>>
> 174307|"org.uniprot.uniprot.GeneNameType"|0|"Rv3312A"|"ORF"|""|174303|3
> >>>>>
> 312550|"org.uniprot.uniprot.GeneNameType"|0|"Rv0236A"|"ORF"|""|312547|2
> >>>>>
> 331661|"org.uniprot.uniprot.GeneNameType"|0|"Rv3198A"|"ORF"|""|331658|2
> >>>>>
> 445836|"org.uniprot.uniprot.GeneNameType"|0|"Rv3346/55c"|"ORF"|""|445833|2
> >>>>>
> 621649|"org.uniprot.uniprot.GeneNameType"|0|"Rv3395A"|"ORF"|""|621647|1
> >>>>>
> 622466|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224B"|"ORF"|""|622464|1
> >>>>>
> 622558|"org.uniprot.uniprot.GeneNameType"|0|"Rv3224A"|"ORF"|""|622556|1
> >>>>>
> 622739|"org.uniprot.uniprot.GeneNameType"|0|"Rv3208A"|"ORF"|""|622736|2
> >>>>>
> 622824|"org.uniprot.uniprot.GeneNameType"|0|"Rv3197A"|"ORF"|""|622821|2
> >>>>>
> 623397|"org.uniprot.uniprot.GeneNameType"|0|"Rv3022A"|"ORF"|""|623394|2
> >>>>>
> 623597|"org.uniprot.uniprot.GeneNameType"|0|"Rv3018A"|"ORF"|""|623594|2
> >>>>>
> 623682|"org.uniprot.uniprot.GeneNameType"|0|"Rv2998A"|"ORF"|""|623680|1
> >>>>>
> 623787|"org.uniprot.uniprot.GeneNameType"|0|"Rv2943A"|"ORF"|""|623785|1
> >>>>>
> 624282|"org.uniprot.uniprot.GeneNameType"|0|"Rv0492A"|"ORF"|""|624280|1
> >>>>>
> 624460|"org.uniprot.uniprot.GeneNameType"|0|"Rv0456A"|"ORF"|""|624458|1
> >>>>>
> 625679|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724B"|"ORF"|""|625676|2
> >>>>>
> 625774|"org.uniprot.uniprot.GeneNameType"|0|"Rv3724A"|"ORF"|""|625771|2
> >>>>>
> 626169|"org.uniprot.uniprot.GeneNameType"|0|"Rv2737A"|"ORF"|""|626167|1
> >>>>>
> 626355|"org.uniprot.uniprot.GeneNameType"|0|"Rv2614A"|"ORF"|""|626353|1
> >>>>>
> 626652|"org.uniprot.uniprot.GeneNameType"|0|"Rv2438A"|"ORF"|""|626650|1
> >>>>>
> 626910|"org.uniprot.uniprot.GeneNameType"|0|"Rv2401A"|"ORF"|""|626908|1
> >>>>>
> 627340|"org.uniprot.uniprot.GeneNameType"|0|"Rv2331A"|"ORF"|""|627338|1
> >>>>>
> 627418|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307B"|"ORF"|""|627416|1
> >>>>>
> 627496|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306B"|"ORF"|""|627494|1
> >>>>>
> 627579|"org.uniprot.uniprot.GeneNameType"|0|"Rv2306A"|"ORF"|""|627577|1
> >>>>>
> 627657|"org.uniprot.uniprot.GeneNameType"|0|"Rv2250A"|"ORF"|""|627655|1
> >>>>>
> 627736|"org.uniprot.uniprot.GeneNameType"|0|"Rv2219A"|"ORF"|""|627734|1
> >>>>>
> 627827|"org.uniprot.uniprot.GeneNameType"|0|"Rv2160A"|"ORF"|""|627825|1
> >>>>>
> 628290|"org.uniprot.uniprot.GeneNameType"|0|"Rv1888A"|"ORF"|""|628288|1
> >>>>>
> 629063|"org.uniprot.uniprot.GeneNameType"|0|"Rv1765A"|"ORF"|""|629061|1
> >>>>>
> 629159|"org.uniprot.uniprot.GeneNameType"|0|"Rv1706A"|"ORF"|""|629157|1
> >>>>>
> 629325|"org.uniprot.uniprot.GeneNameType"|0|"Rv1508A"|"ORF"|""|629323|1
> >>>>>
> 630084|"org.uniprot.uniprot.GeneNameType"|0|"Rv1290A"|"ORF"|""|630082|1
> >>>>>
> 630597|"org.uniprot.uniprot.GeneNameType"|0|"Rv1089A"|"ORF"|""|630594|2
> >>>>>
> 631025|"org.uniprot.uniprot.GeneNameType"|0|"Rv1028A"|"ORF"|""|631022|2
> >>>>>
> 632207|"org.uniprot.uniprot.GeneNameType"|0|"Rv0755A"|"ORF"|""|632205|1
> >>>>>
> 632630|"org.uniprot.uniprot.GeneNameType"|0|"Rv0724A"|"ORF"|""|632628|1
> >>>>>
> 633088|"org.uniprot.uniprot.GeneNameType"|0|"Rv0609A"|"ORF"|""|633086|1
> >>>>>
> 633363|"org.uniprot.uniprot.GeneNameType"|0|"Rv0192A"|"ORF"|""|633361|1
> >>>>>
> 645287|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770B"|"ORF"|""|645284|2
> >>>>>
> 645415|"org.uniprot.uniprot.GeneNameType"|0|"Rv3770A"|"ORF"|""|645412|2
> >>>>>
> 645542|"org.uniprot.uniprot.GeneNameType"|0|"Rv3705A"|"ORF"|""|645539|2
> >>>>>
> 645680|"org.uniprot.uniprot.GeneNameType"|0|"Rv3678A"|"ORF"|""|645677|2
> >>>>>
> 645817|"org.uniprot.uniprot.GeneNameType"|0|"Rv3566A"|"ORF"|""|645814|2
> >>>>>
> 646080|"org.uniprot.uniprot.GeneNameType"|0|"Rv3221A"|"ORF"|""|646077|2
> >>>>>
> 646212|"org.uniprot.uniprot.GeneNameType"|0|"Rv3196A"|"ORF"|""|646209|2
> >>>>>
> 646486|"org.uniprot.uniprot.GeneNameType"|0|"Rv2601A"|"ORF"|""|646483|2
> >>>>>
> 646630|"org.uniprot.uniprot.GeneNameType"|0|"Rv2530A"|"ORF"|""|646627|2
> >>>>>
> 646767|"org.uniprot.uniprot.GeneNameType"|0|"Rv2309A"|"ORF"|""|646764|2
> >>>>>
> 646892|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307D"|"ORF"|""|646889|2
> >>>>>
> 647019|"org.uniprot.uniprot.GeneNameType"|0|"Rv2307A"|"ORF"|""|647016|2
> >>>>>
> 647144|"org.uniprot.uniprot.GeneNameType"|0|"Rv2077A"|"ORF"|""|647141|2
> >>>>> *****The item of note I see is the gene with the slash separating
> gene id's which refer to the same gene.
> >>>>>
> >>>>> Richard
> >>>>>
> >>>>> On Mon, Feb 21, 2011 at 11:11 PM, Richard Brous <rbr...@gm...>
> wrote:
> >>>>> Understood.
> >>>>>
> >>>>> I'll check in with Dr. D in the afternoon tomorrow and discuss.
> >>>>>
> >>>>> Richard
> >>>>>
> >>>>> On Mon, Feb 21, 2011 at 11:06 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> Hi Rich,
> >>>>> Addressing the release business first, let's put it this way: if the
> remaining loose ends can be addressed by tomorrow, we can probably wait
> until then.  If unexpected snags are encountered, then it would be
> worthwhile to release whatever you have.
> >>>>> With that in mind, considering that you pretty much know the patterns
> of the IDs that are needed, I think it will only take a little digital
> forensic work now to figure out exactly which IDs are still needed.  Once
> you know what those are, you should:
> >>>>> 1. Find where they are in the XML file.
> >>>>> 2. Knowing the XML location, find the corresponding table in the
> relational database (table names are generally derived from tag/element
> names).
> >>>>> 3. Knowing the table in the database, write or extend the
> SpeciesProfile query to retrieve that data.
> >>>>> For the ID that must *not* be included, again it's a matter of
> tracking down what this ID is.  Knowing this straggler, you can then consult
> with Dr. Dahlquist if this ID is truly a unique one-off, or is
> representative of a pattern that we'll want to exclude.  Either way, this ID
> can be omitted by using "not" or "<>" or possibly "not like" or "not ~"
> (check PostgreSQL where clause syntax to see where the negation can be
> applied).
> >>>>> John David N. Dionisio, PhD
> >>>>> Associate Professor, Computer Science
> >>>>> Loyola Marymount University
> >>>>> On Feb 22, 2011, at 1:37 AM, Richard Brous wrote:
> >>>>> > actually i had a typo (emailing from desktop system but testing on
> my laptop... typed correctly here but wrong in pgadmin) but the results make
> much more sense now.
> >>>>> >
> >>>>> >
> >>>>> > select count (*) from genenametype where (type = 'ordered locus' or
> type = 'ORF') and value like 'Rv%';
> >>>>> > returns 4058
> >>>>> >
> >>>>> > select count (*) from genenametype where (type = 'ordered locus' or
> type = 'ORF') and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
> >>>>> > returns 4058
> >>>>> >
> >>>>> >
> --------------------------------------------------------------------------------------------------------------------------
> >>>>> >
> >>>>> >
> >>>>> > Continuing forward -
> >>>>> >
> >>>>> > The testing report says that 4066 unique matches exist in XML but 6
> of them were eliminated by Dr. D leaving the desired number at 4060.
> >>>>> >
> >>>>> > So now we are only 2 genes short with the query returning 4058...
> which is also (conveniently) the sum of the two separate queries of 'ordered
> locus' and 'ORF' respectively.
> >>>>> >
> >>>>> >  But recall that Dr. D said that only 69 genes of the missing 75
> were tagged 'ORF' but we seem to have 1 extra gene tagged 'ORF' than we
> expected. Adding that into missing genes puts us 3 short...
> >>>>> >
> >>>>> > Should I make the changes to the code and export a gdb so that
> analysis can be done or wait until we work this through further?
> >>>>> >
> >>>>> > Richard
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Mon, Feb 21, 2011 at 10:04 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > Hi Rich,
> >>>>> >
> >>>>> > The second form should have worked actually.  What exactly was the
> error?
> >>>>> >
> >>>>> > John David N. Dionisio, PhD
> >>>>> > Associate Professor, Computer Science
> >>>>> > Loyola Marymount University
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Feb 22, 2011, at 1:01 AM, Richard Brous wrote:
> >>>>> >
> >>>>> > > hmm not taking parenthesis where I thought they should go...
> syntax error
> >>>>> > >
> >>>>> > > select count (*) from genenametype where type = ('ordered locus'
> or  'ORF') and value like 'Rv%';
> >>>>> > > also tried
> >>>>> > > select count (*) from genenametype where (type = 'ordered locus'
> or type = 'ORF') and value like 'Rv%';
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > > On Mon, Feb 21, 2011 at 9:40 PM, Richard Brous <
> rbr...@gm...> wrote:
> >>>>> > > ah yes... i see it...
> >>>>> > >
> >>>>> > >
> >>>>> > > On Mon, Feb 21, 2011 at 9:33 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > > Watch your parentheses: "and" has greater precedence than "or" :)
> >>>>> > >
> >>>>> > >
> >>>>> > > John David N. Dionisio, PhD
> >>>>> > > Associate Professor, Computer Science
> >>>>> > > Loyola Marymount University
> >>>>> > >
> >>>>> > >
> >>>>> > > On Feb 21, 2011, at 7:59 PM, Richard Brous <rbr...@gm...>
> wrote:
> >>>>> > >
> >>>>> > >> OK, so here are my query results from raw SQL:
> >>>>> > >>
> >>>>> > >> 1. using: like 'Rv%'
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ordered locus'
> and value like 'Rv%';
> >>>>> > >> returns 3988
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ORF' and value
> like 'Rv%';
> >>>>> > >> returns 70
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ordered locus'
> or type = 'ORF' and value like 'Rv%';
> >>>>> > >> returns 7011
> >>>>> > >>
> >>>>> > >> 2. regular expression : value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*'
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ordered locus'
> and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
> >>>>> > >> returns 3988
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ordered locus'
> or type = 'ORF' and value ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
> >>>>> > >> returns 7011
> >>>>> > >>
> >>>>> > >> select count (*) from genenametype where type = 'ORF' and value
> ~ '[Rr][Vv][0-9][0-9][0-9][0-9]*';
> >>>>> > >> returns 70
> >>>>> > >>
> >>>>> > >> Conclusions:
> >>>>> > >>
> >>>>> > >> 1. It seems that querying for type = 'ORF' alone surfaces the 69
> genes were were looking for plus one more (maybe the count for missing genes
> is off by 1?).
> >>>>> > >>
> >>>>> > >> 2. Combining the two types in a single query did not produce the
> results that I expected (7011? - how did that happen????) so this is likely
> not our solution... unless of course the query syntax isn't actually doing
> what I think it is...
> >>>>> > >>
> >>>>> > >> 3. I would think the best course of action is to serialy run two
> separate queries to capture all the required genes, then removing the one
> unneeded gene if its truly not wanted.
> >>>>> > >>
> >>>>> > >> What do you think?
> >>>>> > >>
> >>>>> > >> Richard
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> On Mon, Feb 21, 2011 at 5:17 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > >> I don't recall the exact details of the missing 69, but if your
> query successfully returns them in raw SQL, then this is worth a try.  You
> can integrate into the same query as long as the same columns are returned,
> which is the case here AFAIK, so go ahead and extend the existing query.
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> John David N. Dionisio, PhD
> >>>>> > >> Associate Professor, Computer Science
> >>>>> > >> Loyola Marymount University
> >>>>> > >>
> >>>>> > >> On Feb 21, 2011, at 6:56 PM, Richard Brous <rbr...@gm...>
> wrote:
> >>>>> > >>
> >>>>> > >>> So here is the appropriate code snippet from
> MycobacteriumTuberculosisUniProtSpeciesProfile.java:
> >>>>> > >>> public
> >>>>> > >>>
> >>>>> > >>> TableManager getSystemTableManagerCustomizations(TableManager
> tableManager, TableManager primarySystemTableManager, Date version) throws
> SQLException, InvalidParameterException {
> >>>>> > >>>
> >>>>> > >>> // Build the base query; we only use "ordered locus" and we
> only want
> >>>>> > >>>
> >>>>> > >>> // IDs that begin with "Rv."
> >>>>> > >>> PreparedStatement ps =
> ConnectionManager.getRelationalDBConnection().prepareStatement(
> >>>>> > >>>
> >>>>> > >>> "SELECT value, type " +
> >>>>> > >>>
> >>>>> > >>> "FROM genenametype INNER JOIN entrytype_genetype " +
> >>>>> > >>>
> >>>>> > >>> "ON (entrytype_genetype_name_hjid = entrytype_genetype.hjid) "
> +
> >>>>> > >>>
> >>>>> > >>> "WHERE type = 'ordered locus' and value like 'Rv%' and
> entrytype_gene_hjid = ?");
> >>>>> > >>> ResultSet result;
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> for (Row row : primarySystemTableManager.getRows()) {
> >>>>> > >>> ps.setInt(1, Integer.parseInt(row.getValue(
> >>>>> > >>>
> >>>>> > >>> "UID")));
> >>>>> > >>> result = ps.executeQuery();
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> // We actually want to keep the case where multiple ordered
> locus
> >>>>> > >>>
> >>>>> > >>> // names appear.
> >>>>> > >>>
> >>>>> > >>> while (result.next()) {
> >>>>> > >>>
> >>>>> > >>> // We want this name to appear in the OrderedLocusNames
> >>>>> > >>>
> >>>>> > >>> // system table.
> >>>>> > >>>
> >>>>> > >>> for (String id : result.getString("value").split("/")) {
> >>>>> > >>> tableManager.submit(
> >>>>> > >>>
> >>>>> > >>> "OrderedLocusNames", QueryType.insert, new String[][] { { "ID",
> id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"",
> GenMAPPBuilderUtilities.getSystemsDateString(version) }, { "UID",
> row.getValue("UID") } });
> >>>>> > >>> }
> >>>>> > >>>
> >>>>> > >>> }
> >>>>> > >>>
> >>>>> > >>> }
> >>>>> > >>>
> >>>>> > >>>
>  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>> > >>> So now we want to build the base query which uses "ordered
> locus" and "orf" and we only want IDs that begin with "Rv".
> >>>>> > >>>
> >>>>> > >>> I know there are more comprehensive ways to search for gene
> ID's by matching gene ID prefix but "like Rv%" seemed to work thus far, we
> just need to tell it to search for XML tag type orf in addition to ordered
> locus.
> >>>>> > >>>
> >>>>> > >>> "WHERE type = 'ordered locus' and type = 'orf' and value like
> 'Rv%' and entrytype_gene_hjid = ? "
> >>>>> > >>>
> >>>>> > >>> Here is a stab at it.... This part of our class was right as
> the server went down and my submission for week 6 assignment I can't seem to
> find.
> >>>>> > >>>
> >>>>> > >>> Is it possible to have two different types in the same query or
> should we rewrite a separate query for the orf tag?
> >>>>> > >>>
> >>>>> > >>> Richard
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> On Sun, Feb 20, 2011 at 10:21 PM, Richard Brous <
> rbr...@gm...> wrote:
> >>>>> > >>>
> >>>>> > >>> thanks and will do as directed.
> >>>>> > >>>
> >>>>> > >>> My previous, last paragraph comment - A way for programming
> code in email holding its format in a mail message similarly to how you can
> post code on forum pages?
> >>>>> > >>>
> >>>>> > >>> <code>
> >>>>> > >>> blah
> >>>>> > >>> blah
> >>>>> > >>> blah
> >>>>> > >>> </code>
> >>>>> > >>>
> >>>>> > >>> thanks!
> >>>>> > >>>
> >>>>> > >>> Richard
> >>>>> > >>>
> >>>>> > >>> On Sun, Feb 20, 2011 at 10:05 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > >>>
> >>>>> > >>> Greetings,
> >>>>> > >>>
> >>>>> > >>> Actually, gmbuilder.properties is for the TallyEngine only.
>  When dealing with .gdb exports, look *only* at the SpeciesProfile class.
>  So, to find those 69 IDs, it is the SpeciesProfile code, and *only* the
> SpeciesProfile code, that needs to be changed.
> >>>>> > >>>
> >>>>> > >>> Your take on how gmbuilder.properties is used, however, is
> understandable.  It makes sense to assume that the TallyEngine code *and*
> the ID export code are based on the same characterization of the needed IDs.
>  This replication is originally a historical artifact: SpeciesProfile was
> done first, and then TallyEngine was done later by another student.
> >>>>> > >>>
> >>>>> > >>> However, there are other factors beyond history that sort of
> necessitate this duplication of desired IDs: (skip the two bullets below if
> you'd rather cut to the chase of the work to be done, and discuss design
> issues later)
> >>>>> > >>>
> >>>>> > >>> - The actual XML import code is a black box: this is the
> "canned" JAXB library actually in action, and not our code at all.  Plus,
> the XML import code really does not filter (nor should it), since the goal
> of the XML->relational database step is to fully capture the XML data in the
> relational database.  So, XML count is necessarily separated from XML
> import.
> >>>>> > >>>
> >>>>> > >>> - The notion of a declarative mechanism for extracting IDs from
> the relational database (which is what gmbuilder.properties/TallyEngine
> uses) is interesting, but at the same time there is value in the arbitrary
> computation that can be done with Java (case in point: export two versions
> of an ID, with and without periods).  This is not to say that it is
> impossible to do this declaratively, but let's just say that the procedural
> approach exists here and now, and a declarative approach will need more
> thought.
> >>>>> > >>>
> >>>>> > >>> These, and other factors, are good thoughts to hold onto and
> would be worthy of a good meeting discussion sometime, but bottom line for
> now: modifying the export behavior is a matter of editing the
> *SpeciesProfile* Java code, and not the gmbuilder.properties file.  Turn
> your attention to that code.
> >>>>> > >>>
> >>>>> > >>> Now, as to annotating your code...I'd just put in code comments
>  :)  Or did you mean something else by tagging code in e-mail?
> >>>>> > >>>
> >>>>> > >>> John David N. Dionisio, PhD
> >>>>> > >>> Associate Professor, Computer Science
> >>>>> > >>> Loyola Marymount University
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> On Feb 21, 2011, at 12:38 AM, Richard Brous wrote:
> >>>>> > >>>
> >>>>> > >>> > also, how do I tag code in email so it holds its formatting?
> I tried a few suggestions I found on the web but they aren't holding
> formatting or i'm just doing it wrong ;-D
> >>>>> > >>> >
> >>>>> > >>> > Richard
> >>>>> > >>> >
> >>>>> > >>> > On Sun, Feb 20, 2011 at 9:35 PM, Richard Brous <
> rbr...@gm...> wrote:
> >>>>> > >>> > OK, have some updates and some suggestions:
> >>>>> > >>> >
> >>>>> > >>> > On Friday Dr. Dahlquist and I sat down and reviewed the gene
> testing report. We verified that XML match does indeed find 4066 unique
> matches - 75 of which are not in the gdb and need to be.
> >>>>> > >>> >
> >>>>> > >>> > Dr. Dahlquist informed me that she was the one who completed
> the gene db testing report, not a previous student of BIO367 and had already
> verified which genes were missing and where they were to be found. I had
> (mistakenly) assumed that since a student had performed the gene database
> testing I had to redo all of the verification.
> >>>>> > >>> >
> >>>>> > >>> > So that said, of the 75 genes missing - 69 need to be
> included and 6 excluded.
> >>>>> > >>> > Per the gene db testing report: "69 of them have an "a", "b",
> or "d" suffix. They are all found in the ORF tag and need to be included in
> the gdb."
> >>>>> > >>> >
> >>>>> > >>> > To solve this we need to add additional search criteria into
> the M. tuberculosis section in gmbuilder.properties below:
> >>>>> > >>> > # Mycobacterium tuberculosis
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_level_amount=
> >>>>> > >>> >
> >>>>> > >>> > 1
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_element_level0=
> >>>>> > >>> >
> >>>>> > >>> > uniprot/entry/gene/name&type&ordered locus
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_query_level0=
> >>>>> > >>> >
> >>>>> > >>> > select count(*) from genenametype where type = 'ordered
> locus' and value like 'Rv%';
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_table_name_level0=
> >>>>> > >>> >
> >>>>> > >>> > Ordered Locus
> >>>>> > >>> > SOLUTIONS:
> >>>>> > >>> >
> >>>>> > >>> > 1. So am i correct in my understanding that the second line
> is the query used by TallyEngine to read the XML file? If so then this is
> the issue we need to table for the moment until we get the gbd verified and
> re-released. We will revisit this to discover why it is not only reporting
> incorrectly but also why its added a second row of Ordered Locus on the
> TallyEngine results page.
> >>>>> > >>> >
> >>>>> > >>> > 2. The third line is the SQL query used by postgres during
> the export from XML to gdb. To find and get the ORF tagged genes could we
> not add the following lines and change the count in the first line:
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> > # Mycobacterium tuberculosis
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_level_amount=2
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> >
> mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus
> >>>>> > >>> >
> mycobacteriumtuberculosis_element_level1=uniprot/entry/gene/name&type&orf
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_query_level0=
> >>>>> > >>> >
> >>>>> > >>> > select count(*) from genenametype where type = 'ordered
> locus';
> >>>>> > >>> > mycobacteriumtuberculosis_query_level1=select count(*) from
> genenametype where type = 'orf';
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> > mycobacteriumtuberculosis_table_name_level0=
> >>>>> > >>> >
> >>>>> > >>> > Ordered Locus
> >>>>> > >>> > mycobacteriumtuberculosis_table_name_level1=Ordered Locus
> >>>>> > >>> >
> >>>>> > >>> >
> ----------------------------------------------------------------------------------------------------------------------------
> >>>>> > >>> >
> >>>>> > >>> > Of course these queries would have be manually verified prior
> to making these changes but this seems like we are moving in the right
> direction.
> >>>>> > >>> >
> >>>>> > >>> > Richard
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> > On Thu, Feb 17, 2011 at 7:47 PM, Richard Brous <
> rbr...@gm...> wrote:
> >>>>> > >>> > Just got done reading previous email and understand the
> change in priority.
> >>>>> > >>> >
> >>>>> > >>> > Will work on the missing ID's for now and shelve the the
> TalleyEngine issue for the moment.
> >>>>> > >>> >
> >>>>> > >>> > Also great about a more formalized weekly meeting. I was
> going to suggest it myself as it has been slow going so far as maybe i'm a
> bit too independent in this independent study class =D
> >>>>> > >>> >
> >>>>> > >>> > Will dig further into the missing ID's later tonight and
> during day tomorrow and report back.
> >>>>> > >>> >
> >>>>> > >>> > Richard
> >>>>> > >>> >
> >>>>> > >>> > On Thu, Feb 17, 2011 at 4:34 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > >>> > Hi Rich,
> >>>>> > >>> >
> >>>>> > >>> > No problem.  The pertinent line you're referring to, for XML,
> is this, right above the line you copied:
> >>>>> > >>> >
> >>>>> > >>> >
>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus
> >>>>> > >>> >
> >>>>> > >>> > The slash-separated section is the "path" of XML tags leading
> to the element of interest; then, after the ampersand, is a name/value pair
> for the desired attribute to count.  Note that there is no hint of a
> *content*-based filter (nor is there the capability for one, as far as I can
> tell in the code).  By "content," I mean that we can't specify filters based
> on what's *between* the tags.  We can only go as far as filter by attribute
> value, e.g., type="ordered locus".
> >>>>> > >>> >
> >>>>> > >>> > But anyway, as mentioned in the earlier e-mail, let's have
> the missing IDs in the .gdb take precedence for now.  Please take a look at
> the tuberculosis, A. thaliana, and P. falciparum profiles to get an idea for
> how the ID output can be customized, then let me know if you have any
> questions or need to confirm anything.
> >>>>> > >>> >
> >>>>> > >>> > John David N. Dionisio, PhD
> >>>>> > >>> > Associate Professor, Computer Science
> >>>>> > >>> > Loyola Marymount University
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> >
> >>>>> > >>> > On Feb 17, 2011, at 3:04 PM, Richard Brous wrote:
> >>>>> > >>> >
> >>>>> > >>> > > Sorry been slammed with a programming assignment that kept
> needing continued iteration and it has been all consuming until last night.
> But I did get a chance to work with your comments and review the code again
> with a different mind set.
> >>>>> > >>> > >
> >>>>> > >>> > > Yes, I examined the gmbuilder.properties file ( the query
> is also in the MycobacteriumTuberculosisUniProtSpeciesProfile which I
> mentioned in a previous email ) but I don't think I see what you mean
> regarding the XML count.
> >>>>> > >>> > >
> >>>>> > >>> > > I understood that:
> mycobacteriumtuberculosis_query_level0=select count(*) from genenametype
> where type = 'ordered locus' and value like 'Rv%';  was the db query but
> don't see which is the XML count... or do they share the same query and you
> are saying that XML count doesn't recognize and therefore cannot use the
> 'Rv%' parameter?
> >>>>> > >>> > >
> >>>>> > >>> > > Richard
> >>>>> > >>> > >
> >>>>> > >>> > >
> >>>>> > >>> > >
> >>>>> > >>> > > On Sat, Feb 12, 2011 at 11:46 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > >>> > > Hi Rich,
> >>>>> > >>> > >
> >>>>> > >>> > > Sorry for the delay.  Had some distractions coming into the
> weekend.
> >>>>> > >>> > >
> >>>>> > >>> > > You've looked at the code; have you looked at
> gmbuilder.properties?  (I may have mentioned it a few e-mails ago, just as
> you were starting to dig into this)
> >>>>> > >>> > >
> >>>>> > >>> > > On the copy I have, the M. tuberculosis block looks like
> this (indentation is mine to set it apart):
> >>>>> > >>> > >
> >>>>> > >>> > >        # Mycobacterium tuberculosis
> >>>>> > >>> > >        mycobacteriumtuberculosis_level_amount=1
> >>>>> > >>> > >
> >>>>> > >>> > >
>  mycobacteriumtuberculosis_element_level0=uniprot/entry/gene/name&type&ordered
> locus
> >>>>> > >>> > >
> >>>>> > >>> > >        mycobacteriumtuberculosis_query_level0=select
> count(*) from genenametype where type = 'ordered locus' and value like
> 'Rv%';
> >>>>> > >>> > >
> >>>>> > >>> > >        mycobacteriumtuberculosis_table_name_level0=Ordered
> Locus
> >>>>> > >>> > >
> >>>>> > >>> > > There, I think, is the rub.  Notice that the XML count does
> not filter on RV%.  The SQL query does.
> >>>>> > >>> > >
> >>>>> > >>> > > Unfortunately, I don't think the TallyEngine can include
> selective filtering in the XML counts.  If the need to do selective
> filtering on XML is necessary, then I think we're looking at a new
> functionality for you to implement (or, if this throws things off too much,
> this may have to be noted somewhere, that the XML vs. database counts may be
> off because the database count is doing some text-based filtering but the
> XML count does not).
> >>>>> > >>> > >
> >>>>> > >>> > > What does xmlpipedb-match say?  That will at least tell you
> whether the 'RV%' count is indeed correct.
> >>>>> > >>> > >
> >>>>> > >>> > > John David N. Dionisio, PhD
> >>>>> > >>> > > Associate Professor, Computer Science
> >>>>> > >>> > > Loyola Marymount University
> >>>>> > >>> > >
> >>>>> > >>> > >
> >>>>> > >>> > >
> >>>>> > >>> > > On Feb 11, 2011, at 4:52 PM, Richard Brous wrote:
> >>>>> > >>> > >
> >>>>> > >>> > > > OK here is what I was able to put together from the past
> few hours of code review:
> >>>>> > >>> > > >
> >>>>> > >>> > > > MycobacteriumTuberculosisUniProtSpeciesProfile.java:
> >>>>> > >>> > > > -reveals that after the 2 System table modifications are
> made adding species name and link, a PreparedStatement is instantiated which
> builds and calls the base query.
> >>>>> > >>> > > >
> >>>>> > >>> > > > -The base query called is: ("SELECT value, type " + "FROM
> genenametype INNER JOIN entrytype_genetype " +
> "ON(entrytype_genetype_name_hjid = entrytype_genetype.hjid) " + "WHERE type
> = 'ordered locus' and value like 'Rv%' and entrytype_gene_hjid = ?")
> >>>>> > >>> > > >
> >>>>> > >>> > > > -So its looking in 'ordered locus' table/column for any
> tuple that starts with Rv (followed by any substring) and
> entrytype_gene_hjid = ? .
> >>>>> > >>> > > > The 'like' comparator and % usage are clear with the
> 'type' entrytype_gene_hjid = ?
> >>>>> > >>> > > >
> >>>>> > >>> > > > -To me it seems the query makes sense so the problem is
> likely elsewhere.
> >>>>> > >>> > > >
> >>>>> > >>> > > > GenMappBuilder.java:
> >>>>> > >>> > > > -I found method doTallies() at code line 895 which:
> >>>>> > >>> > > > Instantiates a Configuration called
> hibernateConfiguration and assigns to it the current hibernate configuration
> >>>>> > >>> > > > Validates database settings by analyzing
> hibernateConfiguration
> >>>>> > >>> > > > Instantiates a CriterionList for uniprot and assigns to
> it TallyType.UNIPROT
> >>>>> > >>> > > > Instantiates a CriterionList for go and assigns to it
> TallyType.GO
> >>>>> > >>> > > > Determines if both xml files exist
> >>>>> > >>> > > > Then getTallyResultsXML and getTallyResultsDatabase are
> run on both xml files and their respective CriterionList
> >>>>> > >>> > > > Results are then formatted for display in a table.
> >>>>> > >>> > > >
> >>>>> > >>> > > > -So enum TallyType which means that they are the only
> valid datatypes which TallyEngine accepts... go to know ...
> >>>>> > >>> > > >
> >>>>> > >>> > > > -Based on the screen shot of Tally Engine it would seem
> that both getTallyResultsXML() and getTallyResultsDatabase() are incorrectly
> returning. Likely due to both using an incorrect query (as we previously
> supposed). But where are the queries?... the more I dig the more I think
> they are in the criterial all the work is done against.
> >>>>> > >>> > > >
> >>>>> > >>> > > > continuing the review:
> >>>>> > >>> > > > getTallyResultsXML() calls Tally Engine instance method
> getXmlFileCounts(xmlFile)
> >>>>> > >>> > > > getTallyResultsDatabase() calls Tally Engine instance
> method getDbcounts(new QueryEngine(hibernateConfiguration)
> >>>>> > >>> > > > Both of these instanced methods originate from
> TallyEngine.java...
> >>>>> > >>> > > >
> >>>>> > >>> > > > TallyEngine.java:
> >>>>> > >>> > > >
> >>>>> > >>> > > > getXmlFileCounts() calls digestXmlFile() which
> instantiates a digester then processes against criteria... but this quickly
> becomes confusing and is hard to follow
> >>>>> > >>> > > >
> >>>>> > >>> > > > getDbcounts() then starts a db session and executes a
> query but then I also get a bit lost with my limited db knowledge.
> >>>>> > >>> > > >
> >>>>> > >>> > > >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>> > >>> > > >
> >>>>> > >>> > > > OVERALL I think I'm getting closer to the issues but I
> still feel as if I'm missing some understanding to proceed further. Can you
> pass along some of that Dondi insight and steer me in the right direction?
> =D
> >>>>> > >>> > > >
> >>>>> > >>> > > > -DB Tally - Not having taken databases yet certainly is
> limiting my ability determine where the "criteria" are being set and how
> they are followed during session activities. Also is the query we have been
> looking for this whole time in the criteria or someplace else?
> >>>>> > >>> > > >
> >>>>> > >>> > > > -XML Tally - again is the query contained within the
> criteria that digestXmlFile() uses to parse?
> >>>>> > >>> > > >
> >>>>> > >>> > > > Richard
> >>>>> > >>> > > >
> >>>>> > >>> > > >
> >>>>> > >>> > > > On Mon, Feb 7, 2011 at 5:50 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>> > >>> > > > Right, schema issues are unlikely.  Most count
> discrepancies like this that I've seen have boiled down to forming the right
> query.  Then, knowing the right query (in both XML and SQL), it's a matter
> of making sure that TallyEngine asks that same query.
> >>>>> > >>> > > >
> >>>>> > >>> > > > John David N. Dionisio, PhD
> >>>>> > >>> > > > Associate Professor, Computer Science
> >>>>> > >>> > > > Loyola Marymount University
> >>>>> > >>> > > >
> >>>>> > >>> > > >
> >>>>> > >>> > > > On Feb 7, 2011, at 5:48 PM, Richard Brous wrote:
> >>>>> > >>> > > >
> >>>>> > >>> > > > > OK, so based on your approach:
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > 1. I'll start with reviewing the queries for
> xmlpipedb-match and sql queries needed for the respective results as you
> requested.
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > I was also thinking I may need to review the schema
> from xml into postgres but the issue isn't likely a schema error. The error
> most likely lies in how xmlpipedbutils queries the data from xml source and
> writes to the tables what it returns?
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > 2. I'll review the code: trace the entrance of tally
> engine in the gmbuilder code then follow it through the xmlpipedbutils.
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > Richard
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > On Sat, Feb 5, 2011 at 10:28 AM, John David N. Dionisio
> <do...@lm...> wrote:
> >>>>> > >>> > > > > Just wanted to confirm (since I wasn't sure in the
> first e-mail) --- the XMLPipeDB Utilities source code is in
> trunk/xmlpipedbutils in SourceForge's Subversion repo.
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > John David N. Dionisio, PhD
> >>>>> > >>> > > > > Associate Professor, Computer Science
> >>>>> > >>> > > > > Loyola Marymount University
> >>>>> > >>> > > > >
> >>>>> > >>> > > > >
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > On Feb 5, 2011, at 10:02 AM, Richard Brous wrote:
> >>>>> > >>> > > > >
> >>>>> > >>> > > > > > Hi Dondi,
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > So I'm at the point in working with M tuberculosis
> that I was able to exactly reproduce Dr. Dahlquist's problematic TallyEngine
> results.
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > gmb2b60 Results
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Now the proverbial question - What next to solve the
> Ordered Locus import/count issue?
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > **********************************************
> >>>>> > >>> > > > > > Here is my thought process:
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Step 1: How does the import process work at the high
> level? (obviously correct me if I'm wrong)
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > I believe that basically as each XML tag is read, it
> is placed in the proper Postgres table(s) based on some criteria. There is
> also likely some sort of check that each individual tag is in valid XML
> format unless we don't care at this stage (care at export) or maybe the
> parser just skips over and goes on to the next .
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Step 2: What could be the problem?
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Either -
> >>>>> > >>> > > > > > a. XML tags are being parsed incorrectly
> (ignored/skipped)?
> >>>>> > >>> > > > > > b. Decision criteria of which table they should be
> added to?
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > **********************************************
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > I read on the sourceforge wiki:
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > XMLPipeDB has a modular architecture with three
> components that may be used separately or together. XSD-to-DB reads an XSD
> (XML Schema Definition) and automatically generates an SQL schema, Java
> classes, and Hibernate mappings. XMLPipeDB Utilities provides functionality
> for configuring the database, importing data, and performing queries.
> GenMAPP Builder is based on the XMLPipeDB Utilities and exports
> GenMAPP-compatible Gene Databases based on data from UniProt and Gene
> Ontology (GO).
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > So I should probably start with the XMLPipeDB
> Utilities which are where? I don't see any in the basic distribution or are
> they not standalone and called from the command line?
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Thanks!
> >>>>> > >>> > > > > >
> >>>>> > >>> > > > > > Richard
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Free Software Download: Index, Search & Analyze Logs and other IT data in
> > Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> > generated by your applications, servers and devices whether physical,
> virtual
> > or in the cloud. Deliver compliance at lower cost and gain new business
> > insights. http://p.sf.net/sfu/splunk-dev2dev
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Free Software Download: Index, Search & Analyze Logs and other IT data in
> > Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> > generated by your applications, servers and devices whether physical,
> virtual
> > or in the cloud. Deliver compliance at lower cost and gain new business
> > insights. http://p.sf.net/sfu/splunk-dev2dev
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Free Software Download: Index, Search & Analyze Logs and other IT data in
> > Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> > generated by your applications, servers and devices whether physical,
> virtual
> > or in the cloud. Deliver compliance at lower cost and gain new business
> > insights. http://p.sf.net/sfu/splunk-dev2dev
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> > <ATT00001..txt><ATT00002..txt>
>
>
>
> ------------------------------------------------------------------------------
> Free Software Download: Index, Search & Analyze Logs and other IT data in
> Real-Time with Splunk. Collect, index and harness all the fast moving IT
> data
> generated by your applications, servers and devices whether physical,
> virtual
> or in the cloud. Deliver compliance at lower cost and gain new business
> insights. http://p.sf.net/sfu/splunk-dev2dev
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>