Re: [XMLPipeDB-developer] Plasmodium bug/task list
Brought to you by:
kdahlquist,
zugzugglug
From: Richard B. <rbr...@gm...> - 2011-03-21 00:18:01
|
The pairs were returned by the raw sql where type = 'ORF' The exported gdb doesn't contain the duplicate id's which is why I think its ok. Am I missing something? Richard On Sun, Mar 20, 2011 at 4:00 PM, John David N. Dionisio <do...@lm...>wrote: > Thanks. Try reinstating that last group by clause; it may remove some > duplicate hjid/value pairs. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 1:54 PM, Richard Brous wrote: > > > Sure thing. Can send now that gmail working again... > > > > /** > > > > * > > > > @see > edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > java.util.Date) > > */ > > > > > > @Override > > > > public TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > > > > // Start with the default OrderedLocusNames behavior. > > TableManager result = > > > > super.getSystemTableManagerCustomizations(tableManager, > primarySystemTableManager, version); > > > > > > // Next, we add IDs from the other gene/name tags, but ONLY if they match > > > > // the pattern PF[A-Z][0-9]{4}[a-z]. > > > > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > > > > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > > > > //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > > String sqlQuery = > > > > "select d.entrytype_gene_hjid as hjid, c.value " + > > > > "from genenametype c inner join entrytype_genetype d " + > > > > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > > > //"where (c.value similar to ? " + > > > > "where c.type = 'ORF'"; > > > > //"or c.value similar to ? " + > > > > //"or c.value similar to ?) " + > > > > //"and type <> 'ordered locus names' " + > > > > //"and type <> 'ORF' " + > > > > //"group by d.entrytype_gene_hjid, c.value"; > > > > String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); > > > > Connection c = ConnectionManager.getRelationalDBConnection(); > > > > PreparedStatement ps; > > > > ResultSet rs; > > > > > > try { > > > > // Query, iterate, add to table manager. > > ps = c.prepareStatement(sqlQuery); > > > > > > //ps.setString(1, pfID); > > > > //ps.setString(2, pfID2); > > > > //ps.setString(3, pfID3); > > rs = ps.executeQuery(); > > > > > > while (rs.next()) { > > String hjid = Long.valueOf(rs.getLong( > > > > "hjid")).toString(); > > String id = rs.getString( > > > > "value"); > > > > _Log.debug("Processing raw ID: " + id + " for surrogate " + hjid); > > tableManager.submit( > > > > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { > "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { > "UID", hjid } }); > > } > > > > } > > > > catch(SQLException sqlexc) { > > logSQLException(sqlexc, sqlQuery); > > > > } > > > > > > > > return result; > > } > > > > > > > > On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > Thanks for the updates. Did you post the new SQL query that you used > somewhere? Let me know where I can take a look at it. Or, if you haven't > put it out there yet, send it my way just for a quick look-see. Thanks! > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > > > Having weird probs with gmail in browser where the reply button isn't > working so have to send from my phone... Odd > > > > > > So I here is my analysis on the ORF only export gdb: > > > > > > Initially the raw SQL query returned 5345 records of which I posted an > excel doc link to the plasmodium page on the biodb wiki. > > > > > > The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > > > > > > What I found were 7 id's that were duplicated in the raw SQL query of > postgres but were not exported into the gdb. > > > > > > The id's are: > > > PF10_0168 > > > PF11_0361 (duplicated 2x for total of 3 entries) > > > PF11_0377 > > > PF11_0405 > > > PFB0305c > > > PFB0391c > > > > > > Surprisingly (to me anyway) I noticed that the duplicates all had > unique hjid's, which may or may not mean anything... > > > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely > correct. > > > > > > Dr. D, does that make sense? > > > > > > Richard > > > > > > Sent from my iPhone > > > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > > > >> Yes I understand that you only want the 'ORF' but was trying to get > what we need by modifying what we have and not rewite the whole query. > > >> > > >> I may in fact have to rewrite the whole thing anyway as my query > didn't return what was expected =/ > > >> > > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll > keep an eye out for missing id's here. > > >> > > >> Richard > > >> > > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> > wrote: > > >> Hi, > > >> > > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do > the pattern match at all. As far as I can tell with my analysis of the IDs > in the query you posted, we need to keep them all, so we don't actually need > to specify the patterns at this point. I would rather do that thinking > towards the future when the Plasmodium people might add new patterns to the > ID system. > > >> > > >> I believe that the > > >> > > >> "MAL[0-9]*P1.[0-9]*" > > >> > > >> pattern is also not pulling out everything it needs to. But instead > > >> of including more patterns, I would rather just loosen up the criteria > to > > >> include all things in the ORF tag. > > >> > > >> Also, I just want to be clear about the underscore issue. That only > > >> affects IDs that begin with PFA, not the other IDs that begin with > PF##_ > > >> > > >> Thanks, > > >> Kam > > >> > > >> > > >> > > >> > > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > > >>> I'm down in the bio lab at the moment looking at this. > > >>> > > >>> I understand what needs to be done in regards to 1) keeping all ORF > id's and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > > >>> > > >>> I performed an export with the pattern match as-is and commented out > the exclude 'ordered locus' and 'ORF'. The export completed but with only > 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML > file as seen from the raw sql query. > > >>> > > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went > ahead and added it into the pattern match string and called it as the others > are called in the query. Once the export completes I'll confirm that in fact > we have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > > >>> > > >>> Richard > > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> > wrote: > > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > > >>> > > >>> Richard > > >>> > > >>> Sent from my iPhone > > >>> > > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> > wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> More information on the underscore issue: > > >>>> > > >>>> There is an ID with the pattern > > >>>> > > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> that needs to have the underscore removed so that reads instead > > >>>> > > >>>> PFA[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they > are there without the underscore and won't be recognized with it. I think > we should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > > >>>> > > >>>> If Rich can make these two fixes (capturing the ORFs and dealing > with the underscore), then I think we will be good to go with Plasmodium. > There may be code in the Vibrio or Helicobacter profiles to help with the > underscores, but I'm not sure. > > >>>> > > >>>> Best, > > >>>> Kam > > >>>> > > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > >>>>> Hi, > > >>>>> > > >>>>> I've taken a look at the list of IDs and did a quick comparison > with both the older released gdb and also a list I downloaded from the Broad > Institute Plasmodium database. I think we can safely go with the query on > the ORF tag for our export--all of those different ID forms are valid. > There are about 400 IDs that are different in the older released gdb than > in the new query; I'm going to further investigate those. I suspect that > the difference is mainly due to a +/- underscore issue that we might need to > solve. However, we should go forward with capturing all the IDs from the > ORF tag, I don't see a need to restrict to a particular pattern there. > > >>>>> > > >>>>> Best, > > >>>>> Kam > > >>>>> > > >>>>> At 09:48 PM 3/14/2011, you wrote: > > >>>>>> Hi all, > > >>>>>> > > >>>>>> So I went ahead and did raw sql queries of the Postgres data and > turned up the following: > > >>>>>> > > >>>>>> select * from genenametype where type = 'ordered locus' > > >>>>>> Returned zero gene ids > > >>>>>> > > >>>>>> select * from genenametype where type = 'ORF' > > >>>>>> Returned 5345 gene ids > > >>>>>> The type = 'ORF' query was exported into excel and posted to the > biodb wiki on the Spring 2011 Plasmodium page. > > >>>>>> > > >>>>>> There are many many patterns in regards to gene ids, here the the > prefixes from my cursory look: > > >>>>>> MAL > > >>>>>> PF##_ > > >>>>>> PFA > > >>>>>> PFB > > >>>>>> PFC > > >>>>>> PFD > > >>>>>> PFE > > >>>>>> PFF > > >>>>>> PFI > > >>>>>> PFL > > >>>>>> > > >>>>>> Richard > > >>>>>> > > >>>>>> > > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist < > kda...@lm...> wrote: > > >>>>>> Hi, > > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that > it appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > > >>>>>> Best, > > >>>>>> Kam > > >>>>>> At 08:09 AM 3/14/2011, you wrote: > > >>>>>>> Thanks Dondi, > > >>>>>>> > > >>>>>>> Will review this after our call today. I have been a little > worried as the DEBUG export has been going for 2.5 days with progress at 65% > and 6.5 Gb of log files so far... /yikes > > >>>>>>> > > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be > working from home afterwards instead of in the bio lab. > > >>>>>>> > > >>>>>>> Richard > > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>>>>>> Thanks for the updates, Rich. > > >>>>>>> I gave things a once-over and may have a lead. Here is what I > found: > > >>>>>>> - First, the TallyEngine customization for P. falciparum states > the following: > > >>>>>>> # Plasmodium falciparum > > >>>>>>> plasmodiumfalciparum_level_amount=2 > > >>>>>>> > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > > >>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from > genenametype where type = 'ORF'; > > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from > genenametype where type = 'UniGene'; > > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are > the gene names whose type is 'ORF' ("level0" properties). > > >>>>>>> - Now, this is what the P. falciparum species profile does when > harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, > c.value " + > > >>>>>>> "from genenametype c inner join entrytype_genetype d " > + > > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > >>>>>>> "where (c.value similar to ? " + > > >>>>>>> "or c.value similar to ? " + > > >>>>>>> "or c.value similar to ?) " + > > >>>>>>> "and type <> 'ordered locus names' " + > > >>>>>>> "and type <> 'ORF' " + > > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > > >>>>>>> Note the condition on the second-to-last line --- the query > actually *omits* gene names whose type is 'ORF'! So the question is...which > is right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > > >>>>>>> Still, comparing these two queries directly against the > PostgreSQL database would be educational, I think. Then, knowing which > criteria are correct, the appropriate action can then be taken, I think. > > >>>>>>> Hope this helps... > > >>>>>>> John David N. Dionisio, PhD > > >>>>>>> Associate Professor, Computer Science > > >>>>>>> Loyola Marymount University > > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > >>>>>>> > Debug export is still going... 2.5GB of log files so far with > progress at 65%... > > >>>>>>> > > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > >>>>>>> > Richard > > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > > >>>>>>> > Hi all, > > >>>>>>> > > > >>>>>>> > Have been working through several Plasmodium gdb exports in an > attempt to source why only one gene id makes it into the Ordered Locus > table. > > >>>>>>> > > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't > able to determine anything which would suggest an error. I will post this > log file to the wiki later today when I get home. > > >>>>>>> > > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to > 100MB with hopes that more detail will surface the issue, but my export is > on hour 20 and still going (although its nearly complete). What I didn't > expect was the size of the log files and that it seems only the last 3 are > kept with earlier logs being overwritten =( I fear that the information I > need it in one of the earlier files which are now lost. > > >>>>>>> > > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an > export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each > and hope that 3 GB total will be enough to hold the complete export log. > > >>>>>>> > > > >>>>>>> > More info as it comes... > > >>>>>>> > > > >>>>>>> > Richard > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist < > kda...@lm...> wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last > November and updated the SourceForge wiki. > > >>>>>>> > > > >>>>>>> > Plasmodium has it's own task list page, which I've updated > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > >>>>>>> > > > >>>>>>> > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > >>>>>>> > > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum > page on the Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > >>>>>>> > > > >>>>>>> > Here is the list of bugs/action items that I've listed: > > >>>>>>> > > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of > 5345 repored by the TallyEngine. This also affects all other tables related > to OrderedLocusNames. > > >>>>>>> > > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than > reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium > tuberculosis also have mysterious GeneId issues with the TallyEngine). > > >>>>>>> > > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's > lower than the 2009 version of the gdb. There's no way to tell at this point > whether this is due to a change in annotation by UniProt or is a bug with > GenMAPP Builder. > > >>>>>>> > > > >>>>>>> > Thanks, > > >>>>>>> > Kam > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > ------------------------------------------------------------------------------ > > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > > >>>>>>> > This paper provides an overview of data connectivity, details > > >>>>>>> > its effect on application quality, and explores various > alternative > > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > > >>>>>>> > _______________________________________________ > > >>>>>>> > xmlpipedb-developer mailing list > > >>>>>>> > xml...@li... > > >>>>>>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > > >>>>>>> > ------------------------------------------------------------------------------ > > >>>>>>> Colocation vs. Managed Hosting > > >>>>>>> A question and answer guide to determining the best fit > > >>>>>>> for your organization - today and in the future. > > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > > >>>>>>> _______________________________________________ > > >>>>>>> xmlpipedb-developer mailing list > > >>>>>>> xml...@li... > > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |