Re: [XMLPipeDB-developer] Plasmodium bug/task list
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-03-20 20:54:24
|
Sure thing. Can send now that gmail working again... /** * *@see*edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, java.util.Date) */ @Override *public* TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) *throws*SQLException, InvalidParameterException { // Start with the default OrderedLocusNames behavior. TableManager result = *super*.getSystemTableManagerCustomizations(tableManager, primarySystemTableManager, version); // Next, we add IDs from the other gene/name tags, but ONLY if they match // the pattern PF[A-Z][0-9]{4}[a-z]. //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + "from genenametype c inner join entrytype_genetype d " + "on (c.entrytype_genetype_name_hjid = d.hjid) " + //"where (c.value similar to ? " + "where c.type = 'ORF'"; //"or c.value similar to ? " + //"or c.value similar to ?) " + //"and type <> 'ordered locus names' " + //"and type <> 'ORF' " + //"group by d.entrytype_gene_hjid, c.value"; String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString*(version); Connection c = ConnectionManager.*getRelationalDBConnection*(); PreparedStatement ps; ResultSet rs; *try* { // Query, iterate, add to table manager. ps = c.prepareStatement(sqlQuery); //ps.setString(1, pfID); //ps.setString(2, pfID2); //ps.setString(3, pfID3); rs = ps.executeQuery(); *while* (rs.next()) { String hjid = Long.*valueOf*(rs.getLong("hjid")).toString(); String id = rs.getString("value"); *_Log*.debug("Processing raw ID: " + id + " for surrogate " + hjid); tableManager.submit("OrderedLocusNames", QueryType.*insert*, *new*String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { "UID", hjid } }); } } *catch*(SQLException sqlexc) { logSQLException(sqlexc, sqlQuery); } *return* result; } On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...>wrote: > Hi Rich, > > Thanks for the updates. Did you post the new SQL query that you used > somewhere? Let me know where I can take a look at it. Or, if you haven't > put it out there yet, send it my way just for a quick look-see. Thanks! > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > Having weird probs with gmail in browser where the reply button isn't > working so have to send from my phone... Odd > > > > So I here is my analysis on the ORF only export gdb: > > > > Initially the raw SQL query returned 5345 records of which I posted an > excel doc link to the plasmodium page on the biodb wiki. > > > > The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > > > > What I found were 7 id's that were duplicated in the raw SQL query of > postgres but were not exported into the gdb. > > > > The id's are: > > PF10_0168 > > PF11_0361 (duplicated 2x for total of 3 entries) > > PF11_0377 > > PF11_0405 > > PFB0305c > > PFB0391c > > > > Surprisingly (to me anyway) I noticed that the duplicates all had unique > hjid's, which may or may not mean anything... > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely > correct. > > > > Dr. D, does that make sense? > > > > Richard > > > > Sent from my iPhone > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > >> Yes I understand that you only want the 'ORF' but was trying to get what > we need by modifying what we have and not rewite the whole query. > >> > >> I may in fact have to rewrite the whole thing anyway as my query didn't > return what was expected =/ > >> > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep > an eye out for missing id's here. > >> > >> Richard > >> > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> > wrote: > >> Hi, > >> > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do > the pattern match at all. As far as I can tell with my analysis of the IDs > in the query you posted, we need to keep them all, so we don't actually need > to specify the patterns at this point. I would rather do that thinking > towards the future when the Plasmodium people might add new patterns to the > ID system. > >> > >> I believe that the > >> > >> "MAL[0-9]*P1.[0-9]*" > >> > >> pattern is also not pulling out everything it needs to. But instead > >> of including more patterns, I would rather just loosen up the criteria > to > >> include all things in the ORF tag. > >> > >> Also, I just want to be clear about the underscore issue. That only > >> affects IDs that begin with PFA, not the other IDs that begin with PF##_ > >> > >> Thanks, > >> Kam > >> > >> > >> > >> > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > >>> I'm down in the bio lab at the moment looking at this. > >>> > >>> I understand what needs to be done in regards to 1) keeping all ORF > id's and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > >>> > >>> I performed an export with the pattern match as-is and commented out > the exclude 'ordered locus' and 'ORF'. The export completed but with only > 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML > file as seen from the raw sql query. > >>> > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went > ahead and added it into the pattern match string and called it as the others > are called in the query. Once the export completes I'll confirm that in fact > we have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > >>> > >>> Richard > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> > wrote: > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > >>> > >>> Richard > >>> > >>> Sent from my iPhone > >>> > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > >>> > >>>> Hi, > >>>> > >>>> More information on the underscore issue: > >>>> > >>>> There is an ID with the pattern > >>>> > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> that needs to have the underscore removed so that reads instead > >>>> > >>>> PFA[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > >>>> > >>>> If Rich can make these two fixes (capturing the ORFs and dealing with > the underscore), then I think we will be good to go with Plasmodium. There > may be code in the Vibrio or Helicobacter profiles to help with the > underscores, but I'm not sure. > >>>> > >>>> Best, > >>>> Kam > >>>> > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > >>>>> Hi, > >>>>> > >>>>> I've taken a look at the list of IDs and did a quick comparison with > both the older released gdb and also a list I downloaded from the Broad > Institute Plasmodium database. I think we can safely go with the query on > the ORF tag for our export--all of those different ID forms are valid. > There are about 400 IDs that are different in the older released gdb than > in the new query; I'm going to further investigate those. I suspect that > the difference is mainly due to a +/- underscore issue that we might need to > solve. However, we should go forward with capturing all the IDs from the > ORF tag, I don't see a need to restrict to a particular pattern there. > >>>>> > >>>>> Best, > >>>>> Kam > >>>>> > >>>>> At 09:48 PM 3/14/2011, you wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> So I went ahead and did raw sql queries of the Postgres data and > turned up the following: > >>>>>> > >>>>>> select * from genenametype where type = 'ordered locus' > >>>>>> Returned zero gene ids > >>>>>> > >>>>>> select * from genenametype where type = 'ORF' > >>>>>> Returned 5345 gene ids > >>>>>> The type = 'ORF' query was exported into excel and posted to the > biodb wiki on the Spring 2011 Plasmodium page. > >>>>>> > >>>>>> There are many many patterns in regards to gene ids, here the the > prefixes from my cursory look: > >>>>>> MAL > >>>>>> PF##_ > >>>>>> PFA > >>>>>> PFB > >>>>>> PFC > >>>>>> PFD > >>>>>> PFE > >>>>>> PFF > >>>>>> PFI > >>>>>> PFL > >>>>>> > >>>>>> Richard > >>>>>> > >>>>>> > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> > wrote: > >>>>>> Hi, > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that > it appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > >>>>>> Best, > >>>>>> Kam > >>>>>> At 08:09 AM 3/14/2011, you wrote: > >>>>>>> Thanks Dondi, > >>>>>>> > >>>>>>> Will review this after our call today. I have been a little worried > as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 > Gb of log files so far... /yikes > >>>>>>> > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be > working from home afterwards instead of in the bio lab. > >>>>>>> > >>>>>>> Richard > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio < > do...@lm...> wrote: > >>>>>>> Thanks for the updates, Rich. > >>>>>>> I gave things a once-over and may have a lead. Here is what I > found: > >>>>>>> - First, the TallyEngine customization for P. falciparum states the > following: > >>>>>>> # Plasmodium falciparum > >>>>>>> plasmodiumfalciparum_level_amount=2 > >>>>>>> > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > >>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype > where type = 'ORF'; > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype > where type = 'UniGene'; > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are > the gene names whose type is 'ORF' ("level0" properties). > >>>>>>> - Now, this is what the P. falciparum species profile does when > harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, > c.value " + > >>>>>>> "from genenametype c inner join entrytype_genetype d " + > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > >>>>>>> "where (c.value similar to ? " + > >>>>>>> "or c.value similar to ? " + > >>>>>>> "or c.value similar to ?) " + > >>>>>>> "and type <> 'ordered locus names' " + > >>>>>>> "and type <> 'ORF' " + > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > >>>>>>> Note the condition on the second-to-last line --- the query > actually *omits* gene names whose type is 'ORF'! So the question is...which > is right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > >>>>>>> Still, comparing these two queries directly against the PostgreSQL > database would be educational, I think. Then, knowing which criteria are > correct, the appropriate action can then be taken, I think. > >>>>>>> Hope this helps... > >>>>>>> John David N. Dionisio, PhD > >>>>>>> Associate Professor, Computer Science > >>>>>>> Loyola Marymount University > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > >>>>>>> > Debug export is still going... 2.5GB of log files so far with > progress at 65%... > >>>>>>> > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > >>>>>>> > Richard > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > >>>>>>> > Hi all, > >>>>>>> > > >>>>>>> > Have been working through several Plasmodium gdb exports in an > attempt to source why only one gene id makes it into the Ordered Locus > table. > >>>>>>> > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't > able to determine anything which would suggest an error. I will post this > log file to the wiki later today when I get home. > >>>>>>> > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to > 100MB with hopes that more detail will surface the issue, but my export is > on hour 20 and still going (although its nearly complete). What I didn't > expect was the size of the log files and that it seems only the last 3 are > kept with earlier logs being overwritten =( I fear that the information I > need it in one of the earlier files which are now lost. > >>>>>>> > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an > export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each > and hope that 3 GB total will be enough to hold the complete export log. > >>>>>>> > > >>>>>>> > More info as it comes... > >>>>>>> > > >>>>>>> > Richard > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist < > kda...@lm...> wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last > November and updated the SourceForge wiki. > >>>>>>> > > >>>>>>> > Plasmodium has it's own task list page, which I've updated here: > > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > >>>>>>> > > >>>>>>> > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > >>>>>>> > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page > on the Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >>>>>>> > > >>>>>>> > Here is the list of bugs/action items that I've listed: > >>>>>>> > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of > 5345 repored by the TallyEngine. This also affects all other tables related > to OrderedLocusNames. > >>>>>>> > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than > reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium > tuberculosis also have mysterious GeneId issues with the TallyEngine). > >>>>>>> > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower > than the 2009 version of the gdb. There's no way to tell at this point > whether this is due to a change in annotation by UniProt or is a bug with > GenMAPP Builder. > >>>>>>> > > >>>>>>> > Thanks, > >>>>>>> > Kam > >>>>>>> > > >>>>>>> > > >>>>>>> > > ------------------------------------------------------------------------------ > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > >>>>>>> > This paper provides an overview of data connectivity, details > >>>>>>> > its effect on application quality, and explores various > alternative > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > >>>>>>> > _______________________________________________ > >>>>>>> > xmlpipedb-developer mailing list > >>>>>>> > xml...@li... > >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > >>>>>>> > ------------------------------------------------------------------------------ > >>>>>>> Colocation vs. Managed Hosting > >>>>>>> A question and answer guide to determining the best fit > >>>>>>> for your organization - today and in the future. > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > >>>>>>> _______________________________________________ > >>>>>>> xmlpipedb-developer mailing list > >>>>>>> xml...@li... > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |