Re: [XMLPipeDB-developer] Plasmodium bug/task list
Brought to you by:
kdahlquist,
zugzugglug
From: Kam D. <kda...@lm...> - 2011-03-21 04:37:16
|
Hi, That's OK with me. Best, Kam At 09:22 PM 3/20/2011, you wrote: >I vote that, since the duplication lies at the raw data, we can >leave things as is, and just state as much in the README. > >For the underscores, you can look at the V. cholerae species profile >to get an idea for how to do it, since that species profile has to >do the same thing with its IDs. > >John David N. Dionisio, PhD >Associate Professor, Computer Science >Loyola Marymount University > > > >On Mar 20, 2011, at 9:19 PM, Richard Brous wrote: > > > I did another export after re-enabling the group by and the > results are the same at 5338 gene id's. At least we know either way now. > > > > Moving forward are we just going to leave the duplicate id > situation as is or Dondi can you think of an option here? > > > > Then the last item will be to keep the original underscore id's > but also remove the underscores and add to the 5338. > > > > Richard > > > > On Sun, Mar 20, 2011 at 6:51 PM, Kam Dahlquist <kda...@lm...> wrote: > > Hi, > > > > I looked up those 7 IDs in UniProt, and they each are found in > 2-3 separate UniProt entries, which would explain why they had separate hjids. > > > > Probably what GenMAPP Builder is doing is taking the first one > and relating it to its UniProt ID and then discarding the second > (or third) relationship. I don't know that there's anything we can > do about this since it is a problem with the raw data. > > > > Best, > > Dr. D > > > > > > > > At 12:15 PM 3/20/2011, Richard Brous wrote: > >> Having weird probs with gmail in browser where the reply button > isn't working so have to send from my phone... Odd > >> > >> So I here is my analysis on the ORF only export gdb: > >> > >> Initially the raw SQL query returned 5345 records of which I > posted an excel doc link to the plasmodium page on the biodb wiki. > >> > >> The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > >> > >> What I found were 7 id's that were duplicated in the raw SQL > query of postgres but were not exported into the gdb. > >> > >> The id's are: > >> PF10_0168 > >> PF11_0361 (duplicated 2x for total of 3 entries) > >> PF11_0377 > >> PF11_0405 > >> PFB0305c > >> PFB0391c > >> > >> Surprisingly (to me anyway) I noticed that the duplicates all > had unique hjid's, which may or may not mean anything... > >> > >> In conclusion, it seems to me that the 5338 id's in the gdb are > likely correct. > >> > >> Dr. D, does that make sense? > >> > >> Richard > >> > >> Sent from my iPhone > >> > >> On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> Yes I understand that you only want the 'ORF' but was trying to > get what we need by modifying what we have and not rewite the whole query. > >>> > >>> I may in fact have to rewrite the whole thing anyway as my > query didn't return what was expected =/ > >>> > >>> Also I'm glad you mentioned a possible issue with MAL pattern, > I'll keep an eye out for missing id's here. > >>> > >>> Richard > >>> > >>> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist > <kda...@lm...> wrote: > >>> Hi, > >>> > >>> I think we need to capture all of the IDs in the ORF tag and > *NOT* do the pattern match at all. As far as I can tell with my > analysis of the IDs in the query you posted, we need to keep them > all, so we don't actually need to specify the patterns at this > point. I would rather do that thinking towards the future when the > Plasmodium people might add new patterns to the ID system. > >>> > >>> I believe that the > >>> > >>> > >>> "MAL[0-9]*P1.[0-9]*" > >>> > >>> > >>> > >>> pattern is also not pulling out everything it needs to. But > >>> instead > >>> > >>> > >>> of including more patterns, I would rather just loosen up the > >>> criteria to > >>> > >>> > >>> include all things in the ORF tag. > >>> > >>> > >>> > >>> Also, I just want to be clear about the underscore issue. That > >>> only > >>> > >>> > >>> affects IDs that begin with PFA, not the other IDs that begin with > >>> PF##_ > >>> > >>> > >>> > >>> Thanks, > >>> > >>> > >>> Kam > >>> > >>> > >>> > >>> > >>> At 12:47 PM 3/18/2011, Richard Brous wrote: > >>>> I'm down in the bio lab at the moment looking at this. > >>>> > >>>> I understand what needs to be done in regards to 1) keeping > all ORF id's and then 2) querying the id's with underscores to then > remove the underscores but maintaining the original underscore id's. > >>>> > >>>> I performed an export with the pattern match as-is and > commented out the exclude 'ordered locus' and 'ORF'. The export > completed but with only 5110 gene id's. So it seems we are missing > 235 gene id's that are in the XML file as seen from the raw sql query. > >>>> > >>>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] > I went ahead and added it into the pattern match string and called > it as the others are called in the query. Once the export completes > I'll confirm that in fact we have captured all the id's. Once > confirmed, I will move onto the query to find id's with underscores > and handle them as mentioned above. > >>>> > >>>> Richard > >>>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous > <rbr...@gm...> wrote: > >>>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > >>>> Richard > >>>> Sent from my iPhone > >>>> > >>>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > >>>> > >>>>> Hi, > >>>>> More information on the underscore issue: > >>>>> There is an ID with the pattern > >>>>> PFA_[0-9][0-9][0-9][0-9][wc] > >>>>> that needs to have the underscore removed so that reads instead > >>>>> PFA[0-9][0-9][0-9][0-9][wc] > >>>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, > they are there without the underscore and won't be recognized with > it. I think we should leave the underscore ones there, but also > have a set without the underscore. There are 134 records that have this issue. > >>>>> If Rich can make these two fixes (capturing the ORFs and > dealing with the underscore), then I think we will be good to go > with Plasmodium. There may be code in the Vibrio or Helicobacter > profiles to help with the underscores, but I'm not sure. > >>>>> Best, > >>>>> Kam > >>>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > >>>>>> Hi, > >>>>>> I've taken a look at the list of IDs and did a quick > comparison with both the older released gdb and also a list I > downloaded from the Broad Institute Plasmodium database. I think > we can safely go with the query on the ORF tag for our export--all > of those different ID forms are valid. There are about 400 IDs > that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the > difference is mainly due to a +/- underscore issue that we might > need to solve. However, we should go forward with capturing all > the IDs from the ORF tag, I don't see a need to restrict to a > particular pattern there. > >>>>>> Best, > >>>>>> Kam > >>>>>> At 09:48 PM 3/14/2011, you wrote: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> So I went ahead and did raw sql queries of the Postgres > data and turned up the following: > >>>>>>> > >>>>>>> select * from genenametype where type = 'ordered locus' > >>>>>>> Returned zero gene ids > >>>>>>> > >>>>>>> select * from genenametype where type = 'ORF' > >>>>>>> Returned 5345 gene ids > >>>>>>> The type = 'ORF' query was exported into excel and posted > to the biodb wiki on the Spring 2011 Plasmodium page. > >>>>>>> > >>>>>>> There are many many patterns in regards to gene ids, here > the the prefixes from my cursory look: > >>>>>>> MAL > >>>>>>> PF##_ > >>>>>>> PFA > >>>>>>> PFB > >>>>>>> PFC > >>>>>>> PFD > >>>>>>> PFE > >>>>>>> PFF > >>>>>>> PFI > >>>>>>> PFL > >>>>>>> Richard > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist > <kda...@lm...> wrote: > >>>>>>> Hi, > >>>>>>> I looked up an assortment of IDs in UniProt and I can > confirm that it appears that the IDs are found in the ORF tag, not > the OrderedLocus tag (except for the one that got captured in the export). > >>>>>>> Best, > >>>>>>> Kam > >>>>>>> At 08:09 AM 3/14/2011, you wrote: > >>>>>>>> Thanks Dondi, > >>>>>>>> > >>>>>>>> Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with > progress at 65% and 6.5 Gb of log files so far... /yikes > >>>>>>>> > >>>>>>>> Btw I have a work lunch meeting in Beverly Hills today so > will be working from home afterwards instead of in the bio lab. > >>>>>>>> > >>>>>>>> Richard > >>>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio > <do...@lm...> wrote: > >>>>>>>> Thanks for the updates, Rich. > >>>>>>>> I gave things a once-over and may have a lead. Here is > what I found: > >>>>>>>> - First, the TallyEngine customization for P. falciparum > states the following: > >>>>>>>> # Plasmodium falciparum > >>>>>>>> plasmodiumfalciparum_level_amount=2 > >>>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > >>>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >>>>>>>> plasmodiumfalciparum_query_level0=select count(*) from > genenametype where type = 'ORF'; > >>>>>>>> plasmodiumfalciparum_query_level1=select count(*) from > genenametype where type = 'UniGene'; > >>>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > >>>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > >>>>>>>> Thus, what is being counted by TallyEngine as "Ordered > Locus" are the gene names whose type is 'ORF' ("level0" properties). > >>>>>>>> - Now, this is what the P. falciparum species profile does > when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > >>>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as > hjid, c.value " + > >>>>>>>> "from genenametype c inner join entrytype_genetype d " + > >>>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > >>>>>>>> "where (c.value similar to ? " + > >>>>>>>> "or c.value similar to ? " + > >>>>>>>> "or c.value similar to ?) " + > >>>>>>>> "and type <> 'ordered locus names' " + > >>>>>>>> "and type <> 'ORF' " + > >>>>>>>> "group by d.entrytype_gene_hjid, c.value"; > >>>>>>>> Note the condition on the second-to-last line --- the > query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally > Engine here, since, the export puts only one record in OrderedLocusNames) > >>>>>>>> Still, comparing these two queries directly against the > PostgreSQL database would be educational, I think. Then, knowing > which criteria are correct, the appropriate action can then be taken, I think. > >>>>>>>> Hope this helps... > >>>>>>>> John David N. Dionisio, PhD > >>>>>>>> Associate Professor, Computer Science > >>>>>>>> Loyola Marymount University > >>>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > >>>>>>>> > Debug export is still going... 2.5GB of log files so far > with progress at 65%... > >>>>>>>> > > >>>>>>>> > I posted the link of the WARN log on the plasmodium page > here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > >>>>>>>> > Richard > >>>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous > <rbr...@gm...> wrote: > >>>>>>>> > Hi all, > >>>>>>>> > > >>>>>>>> > Have been working through several Plasmodium gdb exports > in an attempt to source why only one gene id makes it into the > Ordered Locus table. > >>>>>>>> > > >>>>>>>> > I have reviewed the logger file while set to "WARN" and > wasn't able to determine anything which would suggest an error. I > will post this log file to the wiki later today when I get home. > >>>>>>>> > > >>>>>>>> > I then upped the logger verbosity to "DEBUG" and file > size to 100MB with hopes that more detail will surface the issue, > but my export is on hour 20 and still going (although its nearly > complete). What I didn't expect was the size of the log files and > that it seems only the last 3 are kept with earlier logs being > overwritten =( I fear that the information I need it in one of the > earlier files which are now lost. > >>>>>>>> > > >>>>>>>> > Unless a better suggestion is offered I'm going to rerun > an export again with 'DEBUG" verbosity and up the file sizes to > near 1 GB each and hope that 3 GB total will be enough to hold the > complete export log. > >>>>>>>> > > >>>>>>>> > More info as it comes... > >>>>>>>> > > >>>>>>>> > Richard > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist > <kda...@lm...> wrote: > >>>>>>>> > Hi, > >>>>>>>> > > >>>>>>>> > I've completed testing the Plasmodium gdb I exported > last November and updated the SourceForge wiki. > >>>>>>>> > > >>>>>>>> > Plasmodium has it's own task list page, which I've > updated > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > >>>>>>>> > > >>>>>>>> > The testing report can be found > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > >>>>>>>> > > >>>>>>>> > The source files and gdb are on a new Plasmodium > falciparum page on the Fall 2010 BiolDB > wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >>>>>>>> > > >>>>>>>> > Here is the list of bugs/action items that I've listed: > >>>>>>>> > > >>>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID > out of 5345 repored by the TallyEngine. This also affects all other > tables related to OrderedLocusNames. > >>>>>>>> > > >>>>>>>> > 2. The GeneId table in the database has 6 fewer IDs > than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with > the TallyEngine). > >>>>>>>> > > >>>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell > at this point whether this is due to a change in annotation by > UniProt or is a bug with GenMAPP Builder. > >>>>>>>> > > >>>>>>>> > Thanks, > >>>>>>>> > Kam > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > ------------------------------------------------------------------------------ > >>>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > >>>>>>>> > This paper provides an overview of data connectivity, details > >>>>>>>> > its effect on application quality, and explores various > alternative > >>>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > >>>>>>>> > _______________________________________________ > >>>>>>>> > xmlpipedb-developer mailing list > >>>>>>>> > xml...@li... > >>>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > <ATT00001..txt><ATT00002..txt> > >>>>>>>> > ------------------------------------------------------------------------------ > >>>>>>>> Colocation vs. Managed Hosting > >>>>>>>> A question and answer guide to determining the best fit > >>>>>>>> for your organization - today and in the future. > >>>>>>>> http://p.sf.net/sfu/internap-sfd2d > >>>>>>>> _______________________________________________ > >>>>>>>> xmlpipedb-developer mailing list > >>>>>>>> xml...@li... > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > >------------------------------------------------------------------------------ >Colocation vs. Managed Hosting >A question and answer guide to determining the best fit >for your organization - today and in the future. >http://p.sf.net/sfu/internap-sfd2d >_______________________________________________ >xmlpipedb-developer mailing list >xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer |