Re: [XMLPipeDB-developer] Plasmodium bug/task list
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-03-18 20:16:16
|
Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. Richard On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I think we need to capture all of the IDs in the ORF tag and *NOT* do the > pattern match at all. As far as I can tell with my analysis of the IDs in > the query you posted, we need to keep them all, so we don't actually need to > specify the patterns at this point. I would rather do that thinking towards > the future when the Plasmodium people might add new patterns to the ID > system. > > I believe that the > > "MAL[0-9]*P1.[0-9]*" > > pattern is also not pulling out everything it needs to. But instead > of including more patterns, I would rather just loosen up the criteria to > include all things in the ORF tag. > > Also, I just want to be clear about the underscore issue. That only > affects IDs that begin with PFA, not the other IDs that begin with PF##_ > > Thanks, > Kam > > > > > At 12:47 PM 3/18/2011, Richard Brous wrote: > > I'm down in the bio lab at the moment looking at this. > > I understand what needs to be done in regards to 1) keeping all ORF id's > and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > > I performed an export with the pattern match as-is and commented out the > exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 > gene id's. So it seems we are missing 235 gene id's that are in the XML file > as seen from the raw sql query. > > Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead > and added it into the pattern match string and called it as the others are > called in the query. Once the export completes I'll confirm that in fact we > have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > > Richard > On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: > Thanks for info. I will dig into this after my 10 am exam tomorrow. > > Richard > > Sent from my iPhone > > On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > > Hi, > > More information on the underscore issue: > > There is an ID with the pattern > > PFA_[0-9][0-9][0-9][0-9][wc] > > that needs to have the underscore removed so that reads instead > > PFA[0-9][0-9][0-9][0-9][wc] > > I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > > If Rich can make these two fixes (capturing the ORFs and dealing with the > underscore), then I think we will be good to go with Plasmodium. There may > be code in the Vibrio or Helicobacter profiles to help with the underscores, > but I'm not sure. > > Best, > Kam > > At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > Hi, > > I've taken a look at the list of IDs and did a quick comparison with both > the older released gdb and also a list I downloaded from the Broad Institute > Plasmodium database. I think we can safely go with the query on the ORF tag > for our export--all of those different ID forms are valid. There are about > 400 IDs that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the difference is > mainly due to a +/- underscore issue that we might need to solve. However, > we should go forward with capturing all the IDs from the ORF tag, I don't > see a need to restrict to a particular pattern there. > > Best, > Kam > > At 09:48 PM 3/14/2011, you wrote: > > Hi all, > > So I went ahead and did raw sql queries of the Postgres data and turned up > the following: > > select * from genenametype where type = 'ordered locus' > Returned zero gene ids > > select * from genenametype where type = 'ORF' > Returned 5345 gene ids > The type = 'ORF' query was exported into excel and posted to the biodb wiki > on the Spring 2011 Plasmodium page. > > There are many many patterns in regards to gene ids, here the the prefixes > from my cursory look: > MAL > PF##_ > PFA > PFB > PFC > PFD > PFE > PFF > PFI > PFL > > Richard > > > On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> > wrote: Hi, I looked up an assortment of IDs in UniProt and I can confirm > that it appears that the IDs are found in the ORF tag, not the OrderedLocus > tag (except for the one that got captured in the export). Best, Kam > At 08:09 AM 3/14/2011, you wrote: > > Thanks Dondi, Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with progress > at 65% and 6.5 Gb of log files so far... /yikes Btw I have a work lunch > meeting in Beverly Hills today so will be working from home afterwards > instead of in the bio lab. Richard On Sun, Mar 13, 2011 at 9:55 PM, John > David N. Dionisio <do...@lm...> wrote: Thanks for the updates, Rich. I > gave things a once-over and may have a lead. Here is what I found: - > First, the TallyEngine customization for P. falciparum states the following: > # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene plasmodiumfalciparum_query_level0=select > count(*) from genenametype where type = 'ORF'; plasmodiumfalciparum_query_level1=select > count(*) from genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered > Locus plasmodiumfalciparum_table_name_level1=UniGene Thus, what is being > counted by TallyEngine as "Ordered Locus" are the gene names whose type is > 'ORF' ("level0" properties). - Now, this is what the P. falciparum species > profile does when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + "where > (c.value similar to ? " + "or c.value similar to ? " + > "or c.value similar to ?) " + "and type <> 'ordered locus > names' " + "and type <> 'ORF' " + "group by > d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last > line --- the query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally Engine > here, since, the export puts only one record in OrderedLocusNames) Still, > comparing these two queries directly against the PostgreSQL database would > be educational, I think. Then, knowing which criteria are correct, the > appropriate action can then be taken, I think. Hope this helps... John > David N. Dionisio, PhD Associate Professor, Computer Science Loyola > Marymount University > On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > Debug export is still > going... 2.5GB of log files so far with progress at 65%... > > I posted > the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > Hi all, > > Have been working through several > Plasmodium gdb exports in an attempt to source why only one gene id makes it > into the Ordered Locus table. > > I have reviewed the logger file while > set to "WARN" and wasn't able to determine anything which would suggest an > error. I will post this log file to the wiki later today when I get home. > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > Unless a better > suggestion is offered I'm going to rerun an export again with 'DEBUG" > verbosity and up the file sizes to near 1 GB each and hope that 3 GB total > will be enough to hold the complete export log. > > More info as it > comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam > Dahlquist <kda...@lm...> wrote: > Hi, > > I've completed testing the > Plasmodium gdb I exported last November and updated the SourceForge wiki. > > > Plasmodium has it's own task list page, which I've updated here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > Here is the list of bugs/action items that I've listed: > > 1. The > OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the > TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > 2. The GeneId table in the database has 6 fewer > IDs than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with the > TallyEngine). > > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell at this > point whether this is due to a change in annotation by UniProt or is a bug > with GenMAPP Builder. > > Thanks, > Kam > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > This paper > provides an overview of data connectivity, details > its effect on > application quality, and explores various alternative > solutions. > http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > xmlpipedb-developer > mailing list > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting A question and answer guide to determining > the best fit for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d _______________________________________________ > xmlpipedb-developer mailing list xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |