Re: [XMLPipeDB-developer] Plasmodium bug/task list

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Yes I understand that you only want the 'ORF' but was trying to get what we
need by modifying what we have and not rewite the whole query.

I may in fact have to rewrite the whole thing anyway as my query didn't
return what was expected =/

Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an
eye out for missing id's here.

Richard

On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote:

> Hi,
>
> I think we need to capture all of the IDs in the ORF tag and *NOT* do the
> pattern match at all.  As far as I can tell with my analysis of the IDs in
> the query you posted, we need to keep them all, so we don't actually need to
> specify the patterns at this point.  I would rather do that thinking towards
> the future when the Plasmodium people might add new patterns to the ID
> system.
>
> I believe that the
>
> "MAL[0-9]*P1.[0-9]*"
>
> pattern is also not pulling out everything it needs to.  But instead
> of including more patterns, I would rather just loosen up the criteria to
> include all things in the ORF tag.
>
> Also, I just want to be clear about the underscore issue.  That only
> affects IDs that begin with PFA, not the other IDs that begin with PF##_
>
> Thanks,
> Kam
>
>
>
>
>  At 12:47 PM 3/18/2011, Richard Brous wrote:
>
> I'm down in the bio lab at the moment looking at this.
>
> I understand what needs to be done in regards to 1) keeping all ORF id's
> and then 2) querying the id's with underscores to then remove the
> underscores but maintaining the original underscore id's.
>
> I performed an export with the pattern match as-is and commented out the
> exclude 'ordered locus' and 'ORF'. The export completed but with only 5110
> gene id's. So it seems we are missing 235 gene id's that are in the XML file
> as seen from the raw sql query.
>
> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead
> and added it into the pattern match string and called it as the others are
> called in the query. Once the export completes I'll confirm that in fact we
> have captured all the id's. Once confirmed, I will move onto the query to
> find id's with underscores and handle them as mentioned above.
>
> Richard
> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote:
>  Thanks for info. I will dig into this after my 10 am exam tomorrow.
>
> Richard
>
> Sent from my iPhone
>
> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote:
>
>  Hi,
>
> More information on the underscore issue:
>
> There is an ID with the pattern
>
> PFA_[0-9][0-9][0-9][0-9][wc]
>
> that needs to have the underscore removed so that reads instead
>
> PFA[0-9][0-9][0-9][0-9][wc]
>
> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are
> there without the underscore and won't be recognized with it.  I think we
> should leave the underscore ones there, but also have a set without the
> underscore.  There are 134 records that have this issue.
>
> If Rich can make these two fixes (capturing the ORFs and dealing with the
> underscore), then I think we will be good to go with Plasmodium.  There may
> be code in the Vibrio or Helicobacter profiles to help with the underscores,
> but I'm not sure.
>
> Best,
> Kam
>
> At 02:59 PM 3/16/2011, Kam Dahlquist wrote:
>
> Hi,
>
> I've taken a look at the list of IDs and did a quick comparison with both
> the older released gdb and also a list I downloaded from the Broad Institute
> Plasmodium database.  I think we can safely go with the query on the ORF tag
> for our export--all of those different ID forms are valid.  There are about
> 400 IDs that are different in the older released gdb than in the new query;
> I'm going to further investigate those.  I suspect that the difference is
> mainly due to a +/- underscore issue that we might need to solve.  However,
> we should go forward with capturing all the IDs from the ORF tag, I don't
> see a need to restrict to a particular pattern there.
>
> Best,
> Kam
>
> At 09:48 PM 3/14/2011, you wrote:
>
> Hi all,
>
> So I went ahead and did raw sql queries of the Postgres data and turned up
> the following:
>
> select * from genenametype where type = 'ordered locus'
> Returned zero gene ids
>
> select * from genenametype where type = 'ORF'
> Returned 5345 gene ids
> The type = 'ORF' query was exported into excel and posted to the biodb wiki
> on the Spring 2011 Plasmodium page.
>
> There are many many patterns in regards to gene ids, here the the prefixes
> from my cursory look:
> MAL
> PF##_
> PFA
> PFB
> PFC
> PFD
> PFE
> PFF
> PFI
> PFL
>
> Richard
>
>
> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...>
> wrote: Hi, I looked up an assortment of IDs in UniProt and I can confirm
> that it appears that the IDs are found in the ORF tag, not the OrderedLocus
> tag (except for the one that got captured in the export). Best, Kam
> At 08:09 AM 3/14/2011, you wrote:
>
> Thanks Dondi,   Will review this after our call today. I have been a
> little worried as the DEBUG export has been going for 2.5 days with progress
> at 65% and 6.5 Gb of log files so far... /yikes   Btw I have a work lunch
> meeting in Beverly Hills today so will be working from home afterwards
> instead of in the bio lab.   Richard On Sun, Mar 13, 2011 at 9:55 PM, John
> David N. Dionisio <do...@lm...> wrote: Thanks for the updates, Rich. I
> gave things a once-over and may have a lead.  Here is what I found: -
> First, the TallyEngine customization for P. falciparum states the following:
> # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF
> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene plasmodiumfalciparum_query_level0=select
> count(*) from genenametype where type = 'ORF'; plasmodiumfalciparum_query_level1=select
> count(*) from genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered
> Locus plasmodiumfalciparum_table_name_level1=UniGene Thus, what is being
> counted by TallyEngine as "Ordered Locus" are the gene names whose type is
> 'ORF' ("level0" properties). - Now, this is what the P. falciparum species
> profile does when harvesting IDs
> (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations):
>        String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " +
>            "from genenametype c inner join entrytype_genetype d " +
> "on (c.entrytype_genetype_name_hjid = d.hjid) " +            "where
> (c.value similar to ? " +            "or c.value similar to ? " +
> "or c.value similar to ?) " +            "and type <> 'ordered locus
> names' " +            "and type <> 'ORF' " +            "group by
> d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last
> line --- the query actually *omits* gene names whose type is 'ORF'!  So the
> question is...which is right?  (I'm inclined to believe the Tally Engine
> here, since, the export puts only one record in OrderedLocusNames) Still,
> comparing these two queries directly against the PostgreSQL database would
> be educational, I think.  Then, knowing which criteria are correct, the
> appropriate action can then be taken, I think. Hope this helps... John
> David N. Dionisio, PhD Associate Professor, Computer Science Loyola
> Marymount University
> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > Debug export is still
> going... 2.5GB of log files so far with progress at 65%... > > I posted
> the link of the WARN log on the plasmodium page here:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . >
> Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <
> rbr...@gm...> wrote: > Hi all, > > Have been working through several
> Plasmodium gdb exports in an attempt to source why only one gene id makes it
> into the Ordered Locus table. > > I have reviewed the logger file while
> set to "WARN" and wasn't able to determine anything which would suggest an
> error. I will post this log file to the wiki later today when I get home. >
> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with
> hopes that more detail will surface the issue, but my export is on hour 20
> and still going (although its nearly complete). What I didn't expect was the
> size of the log files and that it seems only the last 3 are kept with
> earlier logs being overwritten =( I fear that the information I need it in
> one of the earlier files which are now lost. > > Unless a better
> suggestion is offered I'm going to rerun an export again with 'DEBUG"
> verbosity and up the file sizes to near 1 GB each and hope that 3 GB total
> will be enough to hold the complete export log. > > More info as it
> comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam
> Dahlquist <kda...@lm...> wrote: > Hi, > > I've completed testing the
> Plasmodium gdb I exported last November and updated the SourceForge wiki. >
> > Plasmodium has it's own task list page, which I've updated here:
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >
> > The testing report can be found here:
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >
> > The source files and gdb are on a new Plasmodium falciparum page on the
> Fall 2010 BiolDB wiki:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >
> Here is the list of bugs/action items that I've listed: > > 1.  The
> OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the
> TallyEngine. This also affects all other tables related to
> OrderedLocusNames. > > 2.  The GeneId table in the database has 6 fewer
> IDs than reported by the TallyEngine (Mycobacterium smegmatis and
> Mycobacterium tuberculosis also have mysterious GeneId issues with the
> TallyEngine). > > 3.  The count for EMBL IDs in the gdb also seems low,
> it's lower than the 2009 version of the gdb. There's no way to tell at this
> point whether this is due to a change in annotation by UniProt or is a bug
> with GenMAPP Builder. > > Thanks, > Kam > > >
> ------------------------------------------------------------------------------
> > What You Don't Know About Data Connectivity CAN Hurt You > This paper
> provides an overview of data connectivity, details > its effect on
> application quality, and explores various alternative > solutions.
> http://p.sf.net/sfu/progress-d2d >
> _______________________________________________ > xmlpipedb-developer
> mailing list > xml...@li... >
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > >
> <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting A question and answer guide to determining
> the best fit for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d _______________________________________________
> xmlpipedb-developer mailing list xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>    ------------------------------------------------------------------------------
>
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>