Re: [XMLPipeDB-developer] Plasmodium bug/task list

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

That's OK with me.

Best,
Kam

At 09:22 PM 3/20/2011, you wrote:
>I vote that, since the duplication lies at the raw data, we can 
>leave things as is, and just state as much in the README.
>
>For the underscores, you can look at the V. cholerae species profile 
>to get an idea for how to do it, since that species profile has to 
>do the same thing with its IDs.
>
>John David N. Dionisio, PhD
>Associate Professor, Computer Science
>Loyola Marymount University
>
>
>
>On Mar 20, 2011, at 9:19 PM, Richard Brous wrote:
>
> > I did another export after re-enabling the group by and the 
> results are the same at 5338 gene id's. At least we know either way now.
> >
> > Moving forward are we just going to leave the duplicate id 
> situation as is or Dondi can you think of an option here?
> >
> > Then the last item will be to keep the original underscore id's 
> but also remove the underscores and add to the 5338.
> >
> > Richard
> >
> > On Sun, Mar 20, 2011 at 6:51 PM, Kam Dahlquist <kda...@lm...> wrote:
> > Hi,
> >
> > I looked up those 7 IDs in UniProt, and they each are found in 
> 2-3 separate UniProt entries, which would explain why they had separate hjids.
> >
> > Probably what GenMAPP Builder is doing is taking the first one 
> and relating it to its UniProt ID and then discarding the second 
> (or third) relationship.  I don't know that there's anything we can 
> do about this since it is a problem with the raw data.
> >
> > Best,
> > Dr. D
> >
> >
> >
> > At 12:15 PM 3/20/2011, Richard Brous wrote:
> >> Having weird probs with gmail in browser where the reply button 
> isn't working so have to send from my phone... Odd
> >>
> >> So I here is my analysis on the ORF only export gdb:
> >>
> >> Initially the raw SQL query returned 5345 records of which I 
> posted an excel doc link to the plasmodium page on the biodb wiki.
> >>
> >> The gdb contained only 5338 genes id's so I started looking for 
> duplicates or exclusions.
> >>
> >> What I found were 7 id's that were duplicated in the raw SQL 
> query of postgres but were not exported into the gdb.
> >>
> >> The id's are:
> >> PF10_0168
> >> PF11_0361 (duplicated 2x for total of 3 entries)
> >> PF11_0377
> >> PF11_0405
> >> PFB0305c
> >> PFB0391c
> >>
> >> Surprisingly (to me anyway) I noticed that the duplicates all 
> had unique hjid's, which may or may not mean anything...
> >>
> >> In conclusion, it seems to me that the 5338 id's in the gdb are 
> likely correct.
> >>
> >> Dr. D, does that make sense?
> >>
> >> Richard
> >>
> >> Sent from my iPhone
> >>
> >> On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote:
> >>
> >>> Yes I understand that you only want the 'ORF' but was trying to 
> get what we need by modifying what we have and not rewite the whole query.
> >>>
> >>> I may in fact have to rewrite the whole thing anyway as my 
> query didn't return what was expected =/
> >>>
> >>> Also I'm glad you mentioned a possible issue with MAL pattern, 
> I'll keep an eye out for missing id's here.
> >>>
> >>> Richard
> >>>
> >>> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist 
> <kda...@lm...> wrote:
> >>> Hi,
> >>>
> >>> I think we need to capture all of the IDs in the ORF tag and 
> *NOT* do the pattern match at all.  As far as I can tell with my 
> analysis of the IDs in the query you posted, we need to keep them 
> all, so we don't actually need to specify the patterns at this 
> point.  I would rather do that thinking towards the future when the 
> Plasmodium people might add new patterns to the ID system.
> >>>
> >>> I believe that the
> >>>
> >>>
> >>> "MAL[0-9]*P1.[0-9]*"
> >>>
> >>>
> >>>
> >>> pattern is also not pulling out everything it needs to.  But
> >>> instead
> >>>
> >>>
> >>> of including more patterns, I would rather just loosen up the
> >>> criteria to
> >>>
> >>>
> >>> include all things in the ORF tag.
> >>>
> >>>
> >>>
> >>> Also, I just want to be clear about the underscore issue.  That
> >>> only
> >>>
> >>>
> >>> affects IDs that begin with PFA, not the other IDs that begin with
> >>> PF##_
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>> Kam
> >>>
> >>>
> >>>
> >>>
> >>> At 12:47 PM 3/18/2011, Richard Brous wrote:
> >>>> I'm down in the bio lab at the moment looking at this.
> >>>>
> >>>> I understand what needs to be done in regards to 1) keeping 
> all ORF id's and then 2) querying the id's with underscores to then 
> remove the underscores but maintaining the original underscore id's.
> >>>>
> >>>> I performed an export with the pattern match as-is and 
> commented out the exclude 'ordered locus' and 'ORF'. The export 
> completed but with only 5110 gene id's. So it seems we are missing 
> 235 gene id's that are in the XML file as seen from the raw sql query.
> >>>>
> >>>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] 
> I went ahead and added it into the pattern match string and called 
> it as the others are called in the query. Once the export completes 
> I'll confirm that in fact we have captured all the id's. Once 
> confirmed, I will move onto the query to find id's with underscores 
> and handle them as mentioned above.
> >>>>
> >>>> Richard
> >>>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous 
> <rbr...@gm...> wrote:
> >>>> Thanks for info. I will dig into this after my 10 am exam tomorrow.
> >>>> Richard
> >>>> Sent from my iPhone
> >>>>
> >>>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote:
> >>>>
> >>>>> Hi,
> >>>>> More information on the underscore issue:
> >>>>> There is an ID with the pattern
> >>>>> PFA_[0-9][0-9][0-9][0-9][wc]
> >>>>> that needs to have the underscore removed so that reads instead
> >>>>> PFA[0-9][0-9][0-9][0-9][wc]
> >>>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, 
> they are there without the underscore and won't be recognized with 
> it.  I think we should leave the underscore ones there, but also 
> have a set without the underscore.  There are 134 records that have this issue.
> >>>>> If Rich can make these two fixes (capturing the ORFs and 
> dealing with the underscore), then I think we will be good to go 
> with Plasmodium.  There may be code in the Vibrio or Helicobacter 
> profiles to help with the underscores, but I'm not sure.
> >>>>> Best,
> >>>>> Kam
> >>>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote:
> >>>>>> Hi,
> >>>>>> I've taken a look at the list of IDs and did a quick 
> comparison with both the older released gdb and also a list I 
> downloaded from the Broad Institute Plasmodium database.  I think 
> we can safely go with the query on the ORF tag for our export--all 
> of those different ID forms are valid.  There are about 400 IDs 
> that are different in the older released gdb than in the new query; 
> I'm going to further investigate those.  I suspect that the 
> difference is mainly due to a +/- underscore issue that we might 
> need to solve.  However, we should go forward with capturing all 
> the IDs from the ORF tag, I don't see a need to restrict to a 
> particular pattern there.
> >>>>>> Best,
> >>>>>> Kam
> >>>>>> At 09:48 PM 3/14/2011, you wrote:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> So I went ahead and did raw sql queries of the Postgres 
> data and turned up the following:
> >>>>>>>
> >>>>>>> select * from genenametype where type = 'ordered locus'
> >>>>>>> Returned zero gene ids
> >>>>>>>
> >>>>>>> select * from genenametype where type = 'ORF'
> >>>>>>> Returned 5345 gene ids
> >>>>>>> The type = 'ORF' query was exported into excel and posted 
> to the biodb wiki on the Spring 2011 Plasmodium page.
> >>>>>>>
> >>>>>>> There are many many patterns in regards to gene ids, here 
> the the prefixes from my cursory look:
> >>>>>>> MAL
> >>>>>>> PF##_
> >>>>>>> PFA
> >>>>>>> PFB
> >>>>>>> PFC
> >>>>>>> PFD
> >>>>>>> PFE
> >>>>>>> PFF
> >>>>>>> PFI
> >>>>>>> PFL
> >>>>>>> Richard
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist 
> <kda...@lm...> wrote:
> >>>>>>> Hi,
> >>>>>>> I looked up an assortment of IDs in UniProt and I can 
> confirm that it appears that the IDs are found in the ORF tag, not 
> the OrderedLocus tag (except for the one that got captured in the export).
> >>>>>>> Best,
> >>>>>>> Kam
> >>>>>>> At 08:09 AM 3/14/2011, you wrote:
> >>>>>>>> Thanks Dondi,
> >>>>>>>>
> >>>>>>>> Will review this after our call today. I have been a 
> little worried as the DEBUG export has been going for 2.5 days with 
> progress at 65% and 6.5 Gb of log files so far... /yikes
> >>>>>>>>
> >>>>>>>> Btw I have a work lunch meeting in Beverly Hills today so 
> will be working from home afterwards instead of in the bio lab.
> >>>>>>>>
> >>>>>>>> Richard
> >>>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio 
> <do...@lm...> wrote:
> >>>>>>>> Thanks for the updates, Rich.
> >>>>>>>> I gave things a once-over and may have a lead.  Here is 
> what I found:
> >>>>>>>> - First, the TallyEngine customization for P. falciparum 
> states the following:
> >>>>>>>> # Plasmodium falciparum
> >>>>>>>> plasmodiumfalciparum_level_amount=2
> >>>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF
> >>>>>>>> 
> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene
> >>>>>>>> plasmodiumfalciparum_query_level0=select count(*) from 
> genenametype where type = 'ORF';
> >>>>>>>> plasmodiumfalciparum_query_level1=select count(*) from 
> genenametype where type = 'UniGene';
> >>>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus
> >>>>>>>> plasmodiumfalciparum_table_name_level1=UniGene
> >>>>>>>> Thus, what is being counted by TallyEngine as "Ordered 
> Locus" are the gene names whose type is 'ORF' ("level0" properties).
> >>>>>>>> - Now, this is what the P. falciparum species profile does 
> when harvesting IDs 
> (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations):
> >>>>>>>>        String sqlQuery = "select d.entrytype_gene_hjid as 
> hjid, c.value " +
> >>>>>>>>            "from genenametype c inner join entrytype_genetype d " +
> >>>>>>>>            "on (c.entrytype_genetype_name_hjid = d.hjid) " +
> >>>>>>>>            "where (c.value similar to ? " +
> >>>>>>>>            "or c.value similar to ? " +
> >>>>>>>>            "or c.value similar to ?) " +
> >>>>>>>>            "and type <> 'ordered locus names' " +
> >>>>>>>>            "and type <> 'ORF' " +
> >>>>>>>>            "group by d.entrytype_gene_hjid, c.value";
> >>>>>>>> Note the condition on the second-to-last line --- the 
> query actually *omits* gene names whose type is 'ORF'!  So the 
> question is...which is right?  (I'm inclined to believe the Tally 
> Engine here, since, the export puts only one record in OrderedLocusNames)
> >>>>>>>> Still, comparing these two queries directly against the 
> PostgreSQL database would be educational, I think.  Then, knowing 
> which criteria are correct, the appropriate action can then be taken, I think.
> >>>>>>>> Hope this helps...
> >>>>>>>> John David N. Dionisio, PhD
> >>>>>>>> Associate Professor, Computer Science
> >>>>>>>> Loyola Marymount University
> >>>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote:
> >>>>>>>> > Debug export is still going... 2.5GB of log files so far 
> with progress at 65%...
> >>>>>>>> >
> >>>>>>>> > I posted the link of the WARN log on the plasmodium page 
> here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum .
> >>>>>>>> > Richard
> >>>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous 
> <rbr...@gm...> wrote:
> >>>>>>>> > Hi all,
> >>>>>>>> >
> >>>>>>>> > Have been working through several Plasmodium gdb exports 
> in an attempt to source why only one gene id makes it into the 
> Ordered Locus table.
> >>>>>>>> >
> >>>>>>>> > I have reviewed the logger file while set to "WARN" and 
> wasn't able to determine anything which would suggest an error. I 
> will post this log file to the wiki later today when I get home.
> >>>>>>>> >
> >>>>>>>> > I then upped the logger verbosity to "DEBUG" and file 
> size to 100MB with hopes that more detail will surface the issue, 
> but my export is on hour 20 and still going (although its nearly 
> complete). What I didn't expect was the size of the log files and 
> that it seems only the last 3 are kept with earlier logs being 
> overwritten =( I fear that the information I need it in one of the 
> earlier files which are now lost.
> >>>>>>>> >
> >>>>>>>> > Unless a better suggestion is offered I'm going to rerun 
> an export again with 'DEBUG" verbosity and up the file sizes to 
> near 1 GB each and hope that 3 GB total will be enough to hold the 
> complete export log.
> >>>>>>>> >
> >>>>>>>> > More info as it comes...
> >>>>>>>> >
> >>>>>>>> > Richard
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist 
> <kda...@lm...> wrote:
> >>>>>>>> > Hi,
> >>>>>>>> >
> >>>>>>>> > I've completed testing the Plasmodium gdb I exported 
> last November and updated the SourceForge wiki.
> >>>>>>>> >
> >>>>>>>> > Plasmodium has it's own task list page, which I've 
> updated 
> here: 
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List
> >>>>>>>> >
> >>>>>>>> > The testing report can be found 
> here: 
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115
> >>>>>>>> >
> >>>>>>>> > The source files and gdb are on a new Plasmodium 
> falciparum page on the Fall 2010 BiolDB 
> wiki:  https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum
> >>>>>>>> >
> >>>>>>>> > Here is the list of bugs/action items that I've listed:
> >>>>>>>> >
> >>>>>>>> > 1.  The OrderedLocusNames table in the gdb only has 1 ID 
> out of 5345 repored by the TallyEngine. This also affects all other 
> tables related to OrderedLocusNames.
> >>>>>>>> >
> >>>>>>>> > 2.  The GeneId table in the database has 6 fewer IDs 
> than reported by the TallyEngine (Mycobacterium smegmatis and 
> Mycobacterium tuberculosis also have mysterious GeneId issues with 
> the TallyEngine).
> >>>>>>>> >
> >>>>>>>> > 3.  The count for EMBL IDs in the gdb also seems low, 
> it's lower than the 2009 version of the gdb. There's no way to tell 
> at this point whether this is due to a change in annotation by 
> UniProt or is a bug with GenMAPP Builder.
> >>>>>>>> >
> >>>>>>>> > Thanks,
> >>>>>>>> > Kam
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > 
> ------------------------------------------------------------------------------
> >>>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You
> >>>>>>>> > This paper provides an overview of data connectivity, details
> >>>>>>>> > its effect on application quality, and explores various 
> alternative
> >>>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d
> >>>>>>>> > _______________________________________________
> >>>>>>>> > xmlpipedb-developer mailing list
> >>>>>>>> > xml...@li...
> >>>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > <ATT00001..txt><ATT00002..txt>
> >>>>>>>> 
> ------------------------------------------------------------------------------
> >>>>>>>> Colocation vs. Managed Hosting
> >>>>>>>> A question and answer guide to determining the best fit
> >>>>>>>> for your organization - today and in the future.
> >>>>>>>> http://p.sf.net/sfu/internap-sfd2d
> >>>>>>>> _______________________________________________
> >>>>>>>> xmlpipedb-developer mailing list
> >>>>>>>> xml...@li...
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> > 
> ------------------------------------------------------------------------------
> > Colocation vs. Managed Hosting
> > A question and answer guide to determining the best fit
> > for your organization - today and in the future.
> > http://p.sf.net/sfu/internap-sfd2d
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> > 
> ------------------------------------------------------------------------------
> > Colocation vs. Managed Hosting
> > A question and answer guide to determining the best fit
> > for your organization - today and in the future.
> > http://p.sf.net/sfu/internap-sfd2d
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> >
> > 
> ------------------------------------------------------------------------------
> > Colocation vs. Managed Hosting
> > A question and answer guide to determining the best fit
> > for your organization - today and in the future.
> > http://p.sf.net/sfu/internap-sfd2d
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> >
> > 
> ------------------------------------------------------------------------------
> > Colocation vs. Managed Hosting
> > A question and answer guide to determining the best fit
> > for your organization - today and in the future.
> > http://p.sf.net/sfu/internap-sfd2d
> > _______________________________________________
> > xmlpipedb-developer mailing list
> > xml...@li...
> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >
> >
> > <ATT00001..txt><ATT00002..txt>
>
>
>------------------------------------------------------------------------------
>Colocation vs. Managed Hosting
>A question and answer guide to determining the best fit
>for your organization - today and in the future.
>http://p.sf.net/sfu/internap-sfd2d
>_______________________________________________
>xmlpipedb-developer mailing list
>xml...@li...
>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer