Re: [XMLPipeDB-developer] Plasmodium bug/task list

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The pairs were returned by the raw sql where type = 'ORF'

The exported gdb doesn't contain the duplicate id's which is why I think its
ok. Am I missing something?

Richard

On Sun, Mar 20, 2011 at 4:00 PM, John David N. Dionisio <do...@lm...>wrote:

> Thanks.  Try reinstating that last group by clause; it may remove some
> duplicate hjid/value pairs.
>
> John David N. Dionisio, PhD
> Associate Professor, Computer Science
> Loyola Marymount University
>
>
>  On Mar 20, 2011, at 1:54 PM, Richard Brous wrote:
>
> > Sure thing. Can send now that gmail working again...
> >
> > /**
> >
> > *
> >
> > @see
> edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager,
> edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager,
> java.util.Date)
> > */
> >
> >
> > @Override
> >
> > public TableManager getSystemTableManagerCustomizations(TableManager
> tableManager, TableManager primarySystemTableManager, Date version) throws
> SQLException, InvalidParameterException {
> >
> > // Start with the default OrderedLocusNames behavior.
> > TableManager result =
> >
> > super.getSystemTableManagerCustomizations(tableManager,
> primarySystemTableManager, version);
> >
> >
> > // Next, we add IDs from the other gene/name tags, but ONLY if they match
> >
> > // the pattern PF[A-Z][0-9]{4}[a-z].
> >
> > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]";
> >
> > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]";
> >
> > //final String pfID3 = "MAL[0-9]*P1.[0-9]*";
> > String sqlQuery =
> >
> > "select d.entrytype_gene_hjid as hjid, c.value " +
> >
> > "from genenametype c inner join entrytype_genetype d " +
> >
> > "on (c.entrytype_genetype_name_hjid = d.hjid) " +
> >
> > //"where (c.value similar to ? " +
> >
> > "where c.type = 'ORF'";
> >
> > //"or c.value similar to ? " +
> >
> > //"or c.value similar to ?) " +
> >
> > //"and type <> 'ordered locus names' " +
> >
> > //"and type <> 'ORF' " +
> >
> > //"group by d.entrytype_gene_hjid, c.value";
> >
> > String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version);
> >
> > Connection c = ConnectionManager.getRelationalDBConnection();
> >
> > PreparedStatement ps;
> >
> > ResultSet rs;
> >
> >
> > try {
> >
> > // Query, iterate, add to table manager.
> > ps = c.prepareStatement(sqlQuery);
> >
> >
> > //ps.setString(1, pfID);
> >
> > //ps.setString(2, pfID2);
> >
> > //ps.setString(3, pfID3);
> > rs = ps.executeQuery();
> >
> >
> > while (rs.next()) {
> > String hjid = Long.valueOf(rs.getLong(
> >
> > "hjid")).toString();
> > String id = rs.getString(
> >
> > "value");
> >
> > _Log.debug("Processing raw ID: " + id + " for surrogate " + hjid);
> > tableManager.submit(
> >
> > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, {
> "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, {
> "UID", hjid } });
> > }
> >
> > }
> >
> > catch(SQLException sqlexc) {
> > logSQLException(sqlexc, sqlQuery);
> >
> > }
> >
> >
> >
> > return result;
> > }
> >
> >
> >
> > On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...>
> wrote:
> > Hi Rich,
> >
> > Thanks for the updates.  Did you post the new SQL query that you used
> somewhere?  Let me know where I can take a look at it.  Or, if you haven't
> put it out there yet, send it my way just for a quick look-see.  Thanks!
> >
> > John David N. Dionisio, PhD
> > Associate Professor, Computer Science
> > Loyola Marymount University
> >
> >
> > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote:
> >
> > > Having weird probs with gmail in browser where the reply button isn't
> working so have to send from my phone... Odd
> > >
> > > So I here is my analysis on the ORF only export gdb:
> > >
> > > Initially the raw SQL query returned 5345 records of which I posted an
> excel doc link to the plasmodium page on the biodb wiki.
> > >
> > > The gdb contained only 5338 genes id's so I started looking for
> duplicates or exclusions.
> > >
> > > What I found were 7 id's that were duplicated in the raw SQL query of
> postgres but were not exported into the gdb.
> > >
> > > The id's are:
> > > PF10_0168
> > > PF11_0361 (duplicated 2x for total of 3 entries)
> > > PF11_0377
> > > PF11_0405
> > > PFB0305c
> > > PFB0391c
> > >
> > > Surprisingly (to me anyway) I noticed that the duplicates all had
> unique hjid's, which may or may not mean anything...
> > >
> > > In conclusion, it seems to me that the 5338 id's in the gdb are likely
> correct.
> > >
> > > Dr. D, does that make sense?
> > >
> > > Richard
> > >
> > > Sent from my iPhone
> > >
> > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote:
> > >
> > >> Yes I understand that you only want the 'ORF' but was trying to get
> what we need by modifying what we have and not rewite the whole query.
> > >>
> > >> I may in fact have to rewrite the whole thing anyway as my query
> didn't return what was expected =/
> > >>
> > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll
> keep an eye out for missing id's here.
> > >>
> > >> Richard
> > >>
> > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...>
> wrote:
> > >> Hi,
> > >>
> > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do
> the pattern match at all.  As far as I can tell with my analysis of the IDs
> in the query you posted, we need to keep them all, so we don't actually need
> to specify the patterns at this point.  I would rather do that thinking
> towards the future when the Plasmodium people might add new patterns to the
> ID system.
> > >>
> > >> I believe that the
> > >>
> > >> "MAL[0-9]*P1.[0-9]*"
> > >>
> > >> pattern is also not pulling out everything it needs to.  But instead
> > >> of including more patterns, I would rather just loosen up the criteria
> to
> > >> include all things in the ORF tag.
> > >>
> > >> Also, I just want to be clear about the underscore issue.  That only
> > >> affects IDs that begin with PFA, not the other IDs that begin with
> PF##_
> > >>
> > >> Thanks,
> > >> Kam
> > >>
> > >>
> > >>
> > >>
> > >> At 12:47 PM 3/18/2011, Richard Brous wrote:
> > >>> I'm down in the bio lab at the moment looking at this.
> > >>>
> > >>> I understand what needs to be done in regards to 1) keeping all ORF
> id's and then 2) querying the id's with underscores to then remove the
> underscores but maintaining the original underscore id's.
> > >>>
> > >>> I performed an export with the pattern match as-is and commented out
> the exclude 'ordered locus' and 'ORF'. The export completed but with only
> 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML
> file as seen from the raw sql query.
> > >>>
> > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went
> ahead and added it into the pattern match string and called it as the others
> are called in the query. Once the export completes I'll confirm that in fact
> we have captured all the id's. Once confirmed, I will move onto the query to
> find id's with underscores and handle them as mentioned above.
> > >>>
> > >>> Richard
> > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...>
> wrote:
> > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow.
> > >>>
> > >>> Richard
> > >>>
> > >>> Sent from my iPhone
> > >>>
> > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...>
> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> More information on the underscore issue:
> > >>>>
> > >>>> There is an ID with the pattern
> > >>>>
> > >>>> PFA_[0-9][0-9][0-9][0-9][wc]
> > >>>>
> > >>>> that needs to have the underscore removed so that reads instead
> > >>>>
> > >>>> PFA[0-9][0-9][0-9][0-9][wc]
> > >>>>
> > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they
> are there without the underscore and won't be recognized with it.  I think
> we should leave the underscore ones there, but also have a set without the
> underscore.  There are 134 records that have this issue.
> > >>>>
> > >>>> If Rich can make these two fixes (capturing the ORFs and dealing
> with the underscore), then I think we will be good to go with Plasmodium.
>  There may be code in the Vibrio or Helicobacter profiles to help with the
> underscores, but I'm not sure.
> > >>>>
> > >>>> Best,
> > >>>> Kam
> > >>>>
> > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> I've taken a look at the list of IDs and did a quick comparison
> with both the older released gdb and also a list I downloaded from the Broad
> Institute Plasmodium database.  I think we can safely go with the query on
> the ORF tag for our export--all of those different ID forms are valid.
>  There are about 400 IDs that are different in the older released gdb than
> in the new query; I'm going to further investigate those.  I suspect that
> the difference is mainly due to a +/- underscore issue that we might need to
> solve.  However, we should go forward with capturing all the IDs from the
> ORF tag, I don't see a need to restrict to a particular pattern there.
> > >>>>>
> > >>>>> Best,
> > >>>>> Kam
> > >>>>>
> > >>>>> At 09:48 PM 3/14/2011, you wrote:
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> So I went ahead and did raw sql queries of the Postgres data and
> turned up the following:
> > >>>>>>
> > >>>>>> select * from genenametype where type = 'ordered locus'
> > >>>>>> Returned zero gene ids
> > >>>>>>
> > >>>>>> select * from genenametype where type = 'ORF'
> > >>>>>> Returned 5345 gene ids
> > >>>>>> The type = 'ORF' query was exported into excel and posted to the
> biodb wiki on the Spring 2011 Plasmodium page.
> > >>>>>>
> > >>>>>> There are many many patterns in regards to gene ids, here the the
> prefixes from my cursory look:
> > >>>>>> MAL
> > >>>>>> PF##_
> > >>>>>> PFA
> > >>>>>> PFB
> > >>>>>> PFC
> > >>>>>> PFD
> > >>>>>> PFE
> > >>>>>> PFF
> > >>>>>> PFI
> > >>>>>> PFL
> > >>>>>>
> > >>>>>> Richard
> > >>>>>>
> > >>>>>>
> > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <
> kda...@lm...> wrote:
> > >>>>>> Hi,
> > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that
> it appears that the IDs are found in the ORF tag, not the OrderedLocus tag
> (except for the one that got captured in the export).
> > >>>>>> Best,
> > >>>>>> Kam
> > >>>>>> At 08:09 AM 3/14/2011, you wrote:
> > >>>>>>> Thanks Dondi,
> > >>>>>>>
> > >>>>>>> Will review this after our call today. I have been a little
> worried as the DEBUG export has been going for 2.5 days with progress at 65%
> and 6.5 Gb of log files so far... /yikes
> > >>>>>>>
> > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be
> working from home afterwards instead of in the bio lab.
> > >>>>>>>
> > >>>>>>> Richard
> > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <
> do...@lm...> wrote:
> > >>>>>>> Thanks for the updates, Rich.
> > >>>>>>> I gave things a once-over and may have a lead.  Here is what I
> found:
> > >>>>>>> - First, the TallyEngine customization for P. falciparum states
> the following:
> > >>>>>>> # Plasmodium falciparum
> > >>>>>>> plasmodiumfalciparum_level_amount=2
> > >>>>>>>
> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF
> > >>>>>>>
> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene
> > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from
> genenametype where type = 'ORF';
> > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from
> genenametype where type = 'UniGene';
> > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus
> > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene
> > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are
> the gene names whose type is 'ORF' ("level0" properties).
> > >>>>>>> - Now, this is what the P. falciparum species profile does when
> harvesting IDs
> (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations):
> > >>>>>>>        String sqlQuery = "select d.entrytype_gene_hjid as hjid,
> c.value " +
> > >>>>>>>            "from genenametype c inner join entrytype_genetype d "
> +
> > >>>>>>>            "on (c.entrytype_genetype_name_hjid = d.hjid) " +
> > >>>>>>>            "where (c.value similar to ? " +
> > >>>>>>>            "or c.value similar to ? " +
> > >>>>>>>            "or c.value similar to ?) " +
> > >>>>>>>            "and type <> 'ordered locus names' " +
> > >>>>>>>            "and type <> 'ORF' " +
> > >>>>>>>            "group by d.entrytype_gene_hjid, c.value";
> > >>>>>>> Note the condition on the second-to-last line --- the query
> actually *omits* gene names whose type is 'ORF'!  So the question is...which
> is right?  (I'm inclined to believe the Tally Engine here, since, the export
> puts only one record in OrderedLocusNames)
> > >>>>>>> Still, comparing these two queries directly against the
> PostgreSQL database would be educational, I think.  Then, knowing which
> criteria are correct, the appropriate action can then be taken, I think.
> > >>>>>>> Hope this helps...
> > >>>>>>> John David N. Dionisio, PhD
> > >>>>>>> Associate Professor, Computer Science
> > >>>>>>> Loyola Marymount University
> > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote:
> > >>>>>>> > Debug export is still going... 2.5GB of log files so far with
> progress at 65%...
> > >>>>>>> >
> > >>>>>>> > I posted the link of the WARN log on the plasmodium page here:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum .
> > >>>>>>> > Richard
> > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <
> rbr...@gm...> wrote:
> > >>>>>>> > Hi all,
> > >>>>>>> >
> > >>>>>>> > Have been working through several Plasmodium gdb exports in an
> attempt to source why only one gene id makes it into the Ordered Locus
> table.
> > >>>>>>> >
> > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't
> able to determine anything which would suggest an error. I will post this
> log file to the wiki later today when I get home.
> > >>>>>>> >
> > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to
> 100MB with hopes that more detail will surface the issue, but my export is
> on hour 20 and still going (although its nearly complete). What I didn't
> expect was the size of the log files and that it seems only the last 3 are
> kept with earlier logs being overwritten =( I fear that the information I
> need it in one of the earlier files which are now lost.
> > >>>>>>> >
> > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an
> export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each
> and hope that 3 GB total will be enough to hold the complete export log.
> > >>>>>>> >
> > >>>>>>> > More info as it comes...
> > >>>>>>> >
> > >>>>>>> > Richard
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <
> kda...@lm...> wrote:
> > >>>>>>> > Hi,
> > >>>>>>> >
> > >>>>>>> > I've completed testing the Plasmodium gdb I exported last
> November and updated the SourceForge wiki.
> > >>>>>>> >
> > >>>>>>> > Plasmodium has it's own task list page, which I've updated
> here:
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List
> > >>>>>>> >
> > >>>>>>> > The testing report can be found here:
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115
> > >>>>>>> >
> > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum
> page on the Fall 2010 BiolDB wiki:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum
> > >>>>>>> >
> > >>>>>>> > Here is the list of bugs/action items that I've listed:
> > >>>>>>> >
> > >>>>>>> > 1.  The OrderedLocusNames table in the gdb only has 1 ID out of
> 5345 repored by the TallyEngine. This also affects all other tables related
> to OrderedLocusNames.
> > >>>>>>> >
> > >>>>>>> > 2.  The GeneId table in the database has 6 fewer IDs than
> reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium
> tuberculosis also have mysterious GeneId issues with the TallyEngine).
> > >>>>>>> >
> > >>>>>>> > 3.  The count for EMBL IDs in the gdb also seems low, it's
> lower than the 2009 version of the gdb. There's no way to tell at this point
> whether this is due to a change in annotation by UniProt or is a bug with
> GenMAPP Builder.
> > >>>>>>> >
> > >>>>>>> > Thanks,
> > >>>>>>> > Kam
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> ------------------------------------------------------------------------------
> > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You
> > >>>>>>> > This paper provides an overview of data connectivity, details
> > >>>>>>> > its effect on application quality, and explores various
> alternative
> > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d
> > >>>>>>> > _______________________________________________
> > >>>>>>> > xmlpipedb-developer mailing list
> > >>>>>>> > xml...@li...
> > >>>>>>> >
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > <ATT00001..txt><ATT00002..txt>
> > >>>>>>>
> ------------------------------------------------------------------------------
> > >>>>>>> Colocation vs. Managed Hosting
> > >>>>>>> A question and answer guide to determining the best fit
> > >>>>>>> for your organization - today and in the future.
> > >>>>>>> http://p.sf.net/sfu/internap-sfd2d
> > >>>>>>> _______________________________________________
> > >>>>>>> xmlpipedb-developer mailing list
> > >>>>>>> xml...@li...
> > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >>
> ------------------------------------------------------------------------------
> > >> Colocation vs. Managed Hosting
> > >> A question and answer guide to determining the best fit
> > >> for your organization - today and in the future.
> > >> http://p.sf.net/sfu/internap-sfd2d
> > >> _______________________________________________
> > >> xmlpipedb-developer mailing list
> > >> xml...@li...
> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >>
> > >>
> ------------------------------------------------------------------------------
> > >> Colocation vs. Managed Hosting
> > >> A question and answer guide to determining the best fit
> > >> for your organization - today and in the future.
> > >> http://p.sf.net/sfu/internap-sfd2d
> > >> _______________________________________________
> > >> xmlpipedb-developer mailing list
> > >> xml...@li...
> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >>
> > >>
> > >>
> > >>
> ------------------------------------------------------------------------------
> > >> Colocation vs. Managed Hosting
> > >> A question and answer guide to determining the best fit
> > >> for your organization - today and in the future.
> > >> http://p.sf.net/sfu/internap-sfd2d
> > >> _______________________________________________
> > >> xmlpipedb-developer mailing list
> > >> xml...@li...
> > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> > >>
> > >>
> > > <ATT00001..txt><ATT00002..txt>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Colocation vs. Managed Hosting
> > A question and answer guide to determining the best fit
> > for your organization - today and in the future.
> > http://p.sf.net/sfu/internap-sfd2d
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> > <ATT00001..txt><ATT00002..txt>
>
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>