Re: [XMLPipeDB-developer] Plasmodium bug/task list

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I think we need to capture all of the IDs in the ORF tag and *NOT* do 
the pattern match at all.  As far as I can tell with my analysis of 
the IDs in the query you posted, we need to keep them all, so we 
don't actually need to specify the patterns at this point.  I would 
rather do that thinking towards the future when the Plasmodium people 
might add new patterns to the ID system.

I believe that the

"MAL[0-9]*P1.[0-9]*"

pattern is also not pulling out everything it needs to.  But instead 
of including more patterns, I would rather just loosen up the 
criteria to include all things in the ORF tag.

Also, I just want to be clear about the underscore issue.  That only 
affects IDs that begin with PFA, not the other IDs that begin with PF##_

Thanks,
Kam

At 12:47 PM 3/18/2011, Richard Brous wrote:
>I'm down in the bio lab at the moment looking at this.
>
>I understand what needs to be done in regards to 1) keeping all ORF 
>id's and then 2) querying the id's with underscores to then remove 
>the underscores but maintaining the original underscore id's.
>
>I performed an export with the pattern match as-is and commented out 
>the exclude 'ordered locus' and 'ORF'. The export completed but with 
>only 5110 gene id's. So it seems we are missing 235 gene id's that 
>are in the XML file as seen from the raw sql query.
>
>Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I 
>went ahead and added it into the pattern match string and called it 
>as the others are called in the query. Once the export completes 
>I'll confirm that in fact we have captured all the id's. Once 
>confirmed, I will move onto the query to find id's with underscores 
>and handle them as mentioned above.
>
>Richard
>On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous 
><<mailto:rbr...@gm...>rbr...@gm...> wrote:
>Thanks for info. I will dig into this after my 10 am exam tomorrow.
>
>Richard
>
>Sent from my iPhone
>
>On Mar 16, 2011, at 4:18 PM, Kam Dahlquist 
><<mailto:kda...@lm...>kda...@lm...> wrote:
>
>>Hi,
>>
>>More information on the underscore issue:
>>
>>There is an ID with the pattern
>>
>>PFA_[0-9][0-9][0-9][0-9][wc]
>>
>>that needs to have the underscore removed so that reads instead
>>
>>PFA[0-9][0-9][0-9][0-9][wc]
>>
>>I don't know why these IDs exist in UniProt, but in PlasmoDB, they 
>>are there without the underscore and won't be recognized with 
>>it.  I think we should leave the underscore ones there, but also 
>>have a set without the underscore.  There are 134 records that have this issue.
>>
>>If Rich can make these two fixes (capturing the ORFs and dealing 
>>with the underscore), then I think we will be good to go with 
>>Plasmodium.  There may be code in the Vibrio or Helicobacter 
>>profiles to help with the underscores, but I'm not sure.
>>
>>Best,
>>Kam
>>
>>At 02:59 PM 3/16/2011, Kam Dahlquist wrote:
>>>Hi,
>>>
>>>I've taken a look at the list of IDs and did a quick comparison 
>>>with both the older released gdb and also a list I downloaded from 
>>>the Broad Institute Plasmodium database.  I think we can safely go 
>>>with the query on the ORF tag for our export--all of those 
>>>different ID forms are valid.  There are about 400 IDs that are 
>>>different in the older released gdb than in the new query; I'm 
>>>going to further investigate those.  I suspect that the difference 
>>>is mainly due to a +/- underscore issue that we might need to 
>>>solve.  However, we should go forward with capturing all the IDs 
>>>from the ORF tag, I don't see a need to restrict to a particular pattern there.
>>>
>>>Best,
>>>Kam
>>>
>>>At 09:48 PM 3/14/2011, you wrote:
>>>>Hi all,
>>>>
>>>>So I went ahead and did raw sql queries of the Postgres data and 
>>>>turned up the following:
>>>>
>>>>select * from genenametype where type = 'ordered locus'
>>>>Returned zero gene ids
>>>>
>>>>select * from genenametype where type = 'ORF'
>>>>Returned 5345 gene ids
>>>>The type = 'ORF' query was exported into excel and posted to the 
>>>>biodb wiki on the Spring 2011 Plasmodium page.
>>>>
>>>>There are many many patterns in regards to gene ids, here the the 
>>>>prefixes from my cursory look:
>>>>MAL
>>>>PF##_
>>>>PFA
>>>>PFB
>>>>PFC
>>>>PFD
>>>>PFE
>>>>PFF
>>>>PFI
>>>>PFL
>>>>
>>>>Richard
>>>>
>>>>
>>>>On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist 
>>>><<mailto:kda...@lm...>kda...@lm...> wrote:
>>>>Hi,
>>>>I looked up an assortment of IDs in UniProt and I can confirm 
>>>>that it appears that the IDs are found in the ORF tag, not the 
>>>>OrderedLocus tag (except for the one that got captured in the export).
>>>>Best,
>>>>Kam
>>>>At 08:09 AM 3/14/2011, you wrote:
>>>>>Thanks Dondi,
>>>>>
>>>>>Will review this after our call today. I have been a little 
>>>>>worried as the DEBUG export has been going for 2.5 days with 
>>>>>progress at 65% and 6.5 Gb of log files so far... /yikes
>>>>>
>>>>>Btw I have a work lunch meeting in Beverly Hills today so will 
>>>>>be working from home afterwards instead of in the bio lab.
>>>>>
>>>>>Richard
>>>>>On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio 
>>>>><<mailto:do...@lm...>do...@lm...> wrote:
>>>>>Thanks for the updates, Rich.
>>>>>I gave things a once-over and may have a lead.  Here is what I found:
>>>>>- First, the TallyEngine customization for P. falciparum states 
>>>>>the following:
>>>>># Plasmodium falciparum
>>>>>plasmodiumfalciparum_level_amount=2
>>>>>plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF
>>>>>plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene
>>>>>plasmodiumfalciparum_query_level0=select count(*) from 
>>>>>genenametype where type = 'ORF';
>>>>>plasmodiumfalciparum_query_level1=select count(*) from 
>>>>>genenametype where type = 'UniGene';
>>>>>plasmodiumfalciparum_table_name_level0=Ordered Locus
>>>>>plasmodiumfalciparum_table_name_level1=UniGene
>>>>>Thus, what is being counted by TallyEngine as "Ordered Locus" 
>>>>>are the gene names whose type is 'ORF' ("level0" properties).
>>>>>- Now, this is what the P. falciparum species profile does when 
>>>>>harvesting IDs 
>>>>>(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): 
>>>>>
>>>>>        String sqlQuery = "select d.entrytype_gene_hjid as hjid, 
>>>>> c.value " +
>>>>>            "from genenametype c inner join entrytype_genetype d " +
>>>>>            "on (c.entrytype_genetype_name_hjid = d.hjid) " +
>>>>>            "where (c.value similar to ? " +
>>>>>            "or c.value similar to ? " +
>>>>>            "or c.value similar to ?) " +
>>>>>            "and type <> 'ordered locus names' " +
>>>>>            "and type <> 'ORF' " +
>>>>>            "group by d.entrytype_gene_hjid, c.value";
>>>>>Note the condition on the second-to-last line --- the query 
>>>>>actually *omits* gene names whose type is 'ORF'!  So the 
>>>>>question is...which is right?  (I'm inclined to believe the 
>>>>>Tally Engine here, since, the export puts only one record in 
>>>>>OrderedLocusNames)
>>>>>Still, comparing these two queries directly against the 
>>>>>PostgreSQL database would be educational, I think.  Then, 
>>>>>knowing which criteria are correct, the appropriate action can 
>>>>>then be taken, I think.
>>>>>Hope this helps...
>>>>>John David N. Dionisio, PhD
>>>>>Associate Professor, Computer Science
>>>>>Loyola Marymount University
>>>>>On Mar 12, 2011, at 9:48 AM, Richard Brous wrote:
>>>>> > Debug export is still going... 2.5GB of log files so far with 
>>>>> progress at 65%...
>>>>> >
>>>>> > I posted the link of the WARN log on the plasmodium page 
>>>>> here: 
>>>>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum 
>>>>> .
>>>>> > Richard
>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous 
>>>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > Have been working through several Plasmodium gdb exports in 
>>>>> an attempt to source why only one gene id makes it into the 
>>>>> Ordered Locus table.
>>>>> >
>>>>> > I have reviewed the logger file while set to "WARN" and 
>>>>> wasn't able to determine anything which would suggest an error. 
>>>>> I will post this log file to the wiki later today when I get home.
>>>>> >
>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 
>>>>> 100MB with hopes that more detail will surface the issue, but 
>>>>> my export is on hour 20 and still going (although its nearly 
>>>>> complete). What I didn't expect was the size of the log files 
>>>>> and that it seems only the last 3 are kept with earlier logs 
>>>>> being overwritten =( I fear that the information I need it in 
>>>>> one of the earlier files which are now lost.
>>>>> >
>>>>> > Unless a better suggestion is offered I'm going to rerun an 
>>>>> export again with 'DEBUG" verbosity and up the file sizes to 
>>>>> near 1 GB each and hope that 3 GB total will be enough to hold 
>>>>> the complete export log.
>>>>> >
>>>>> > More info as it comes...
>>>>> >
>>>>> > Richard
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist 
>>>>> <<mailto:kda...@lm...>kda...@lm...> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I've completed testing the Plasmodium gdb I exported last 
>>>>> November and updated the SourceForge wiki.
>>>>> >
>>>>> > Plasmodium has it's own task list page, which I've updated 
>>>>> here: 
>>>>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List 
>>>>>
>>>>> >
>>>>> > The testing report can be found 
>>>>> here: 
>>>>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 
>>>>>
>>>>> >
>>>>> > The source files and gdb are on a new Plasmodium falciparum 
>>>>> page on the Fall 2010 BiolDB 
>>>>> wiki: 
>>>>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum 
>>>>>
>>>>> >
>>>>> > Here is the list of bugs/action items that I've listed:
>>>>> >
>>>>> > 1.  The OrderedLocusNames table in the gdb only has 1 ID out 
>>>>> of 5345 repored by the TallyEngine. This also affects all other 
>>>>> tables related to OrderedLocusNames.
>>>>> >
>>>>> > 2.  The GeneId table in the database has 6 fewer IDs than 
>>>>> reported by the TallyEngine (Mycobacterium smegmatis and 
>>>>> Mycobacterium tuberculosis also have mysterious GeneId issues 
>>>>> with the TallyEngine).
>>>>> >
>>>>> > 3.  The count for EMBL IDs in the gdb also seems low, it's 
>>>>> lower than the 2009 version of the gdb. There's no way to tell 
>>>>> at this point whether this is due to a change in annotation by 
>>>>> UniProt or is a bug with GenMAPP Builder.
>>>>> >
>>>>> > Thanks,
>>>>> > Kam
>>>>> >
>>>>> >
>>>>> > 
>>>>> ------------------------------------------------------------------------------
>>>>> > What You Don't Know About Data Connectivity CAN Hurt You
>>>>> > This paper provides an overview of data connectivity, details
>>>>> > its effect on application quality, and explores various alternative
>>>>> > solutions. 
>>>>> <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d
>>>>> > _______________________________________________
>>>>> > xmlpipedb-developer mailing list
>>>>> > 
>>>>> <mailto:xml...@li...>xml...@li... 
>>>>>
>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>>>> >
>>>>> >
>>>>> >
>>>>> > <ATT00001..txt><ATT00002..txt>
>>>>>------------------------------------------------------------------------------ 
>>>>>
>>>>>Colocation vs. Managed Hosting
>>>>>A question and answer guide to determining the best fit
>>>>>for your organization - today and in the future.
>>>>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d
>>>>>_______________________________________________
>>>>>xmlpipedb-developer mailing list
>>>>><mailto:xml...@li...>xml...@li... 
>>>>>
>>>>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>>>------------------------------------------------------------------------------ 
>>>>
>>>>Colocation vs. Managed Hosting
>>>>A question and answer guide to determining the best fit
>>>>for your organization - today and in the future.
>>>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d
>>>>_______________________________________________
>>>>xmlpipedb-developer mailing list
>>>><mailto:xml...@li...>xml...@li... 
>>>>
>>>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>>>>
>>------------------------------------------------------------------------------
>>Colocation vs. Managed Hosting
>>A question and answer guide to determining the best fit
>>for your organization - today and in the future.
>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d
>>_______________________________________________
>>xmlpipedb-developer mailing list
>><mailto:xml...@li...>xml...@li...
>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
>
>