Re: [XMLPipeDB-developer] Plasmodium bug/task list
Brought to you by:
kdahlquist,
zugzugglug
|
From: Richard B. <rbr...@gm...> - 2011-03-20 20:54:24
|
Sure thing. Can send now that gmail working again...
/**
* *@see*edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager,
edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager,
java.util.Date)
*/
@Override
*public* TableManager getSystemTableManagerCustomizations(TableManager
tableManager, TableManager primarySystemTableManager, Date version)
*throws*SQLException, InvalidParameterException {
// Start with the default OrderedLocusNames behavior.
TableManager result = *super*.getSystemTableManagerCustomizations(tableManager,
primarySystemTableManager, version);
// Next, we add IDs from the other gene/name tags, but ONLY if they match
// the pattern PF[A-Z][0-9]{4}[a-z].
//final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]";
//final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]";
//final String pfID3 = "MAL[0-9]*P1.[0-9]*";
String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " +
"from genenametype c inner join entrytype_genetype d " +
"on (c.entrytype_genetype_name_hjid = d.hjid) " +
//"where (c.value similar to ? " +
"where c.type = 'ORF'";
//"or c.value similar to ? " +
//"or c.value similar to ?) " +
//"and type <> 'ordered locus names' " +
//"and type <> 'ORF' " +
//"group by d.entrytype_gene_hjid, c.value";
String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString*(version);
Connection c = ConnectionManager.*getRelationalDBConnection*();
PreparedStatement ps;
ResultSet rs;
*try* {
// Query, iterate, add to table manager.
ps = c.prepareStatement(sqlQuery);
//ps.setString(1, pfID);
//ps.setString(2, pfID2);
//ps.setString(3, pfID3);
rs = ps.executeQuery();
*while* (rs.next()) {
String hjid = Long.*valueOf*(rs.getLong("hjid")).toString();
String id = rs.getString("value");
*_Log*.debug("Processing raw ID: " + id + " for surrogate " + hjid);
tableManager.submit("OrderedLocusNames", QueryType.*insert*,
*new*String[][] { {
"ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"",
dateToday }, { "UID", hjid } });
}
} *catch*(SQLException sqlexc) {
logSQLException(sqlexc, sqlQuery);
}
*return* result;
}
On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...>wrote:
> Hi Rich,
>
> Thanks for the updates. Did you post the new SQL query that you used
> somewhere? Let me know where I can take a look at it. Or, if you haven't
> put it out there yet, send it my way just for a quick look-see. Thanks!
>
> John David N. Dionisio, PhD
> Associate Professor, Computer Science
> Loyola Marymount University
>
>
> On Mar 20, 2011, at 12:15 PM, Richard Brous wrote:
>
> > Having weird probs with gmail in browser where the reply button isn't
> working so have to send from my phone... Odd
> >
> > So I here is my analysis on the ORF only export gdb:
> >
> > Initially the raw SQL query returned 5345 records of which I posted an
> excel doc link to the plasmodium page on the biodb wiki.
> >
> > The gdb contained only 5338 genes id's so I started looking for
> duplicates or exclusions.
> >
> > What I found were 7 id's that were duplicated in the raw SQL query of
> postgres but were not exported into the gdb.
> >
> > The id's are:
> > PF10_0168
> > PF11_0361 (duplicated 2x for total of 3 entries)
> > PF11_0377
> > PF11_0405
> > PFB0305c
> > PFB0391c
> >
> > Surprisingly (to me anyway) I noticed that the duplicates all had unique
> hjid's, which may or may not mean anything...
> >
> > In conclusion, it seems to me that the 5338 id's in the gdb are likely
> correct.
> >
> > Dr. D, does that make sense?
> >
> > Richard
> >
> > Sent from my iPhone
> >
> > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote:
> >
> >> Yes I understand that you only want the 'ORF' but was trying to get what
> we need by modifying what we have and not rewite the whole query.
> >>
> >> I may in fact have to rewrite the whole thing anyway as my query didn't
> return what was expected =/
> >>
> >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep
> an eye out for missing id's here.
> >>
> >> Richard
> >>
> >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...>
> wrote:
> >> Hi,
> >>
> >> I think we need to capture all of the IDs in the ORF tag and *NOT* do
> the pattern match at all. As far as I can tell with my analysis of the IDs
> in the query you posted, we need to keep them all, so we don't actually need
> to specify the patterns at this point. I would rather do that thinking
> towards the future when the Plasmodium people might add new patterns to the
> ID system.
> >>
> >> I believe that the
> >>
> >> "MAL[0-9]*P1.[0-9]*"
> >>
> >> pattern is also not pulling out everything it needs to. But instead
> >> of including more patterns, I would rather just loosen up the criteria
> to
> >> include all things in the ORF tag.
> >>
> >> Also, I just want to be clear about the underscore issue. That only
> >> affects IDs that begin with PFA, not the other IDs that begin with PF##_
> >>
> >> Thanks,
> >> Kam
> >>
> >>
> >>
> >>
> >> At 12:47 PM 3/18/2011, Richard Brous wrote:
> >>> I'm down in the bio lab at the moment looking at this.
> >>>
> >>> I understand what needs to be done in regards to 1) keeping all ORF
> id's and then 2) querying the id's with underscores to then remove the
> underscores but maintaining the original underscore id's.
> >>>
> >>> I performed an export with the pattern match as-is and commented out
> the exclude 'ordered locus' and 'ORF'. The export completed but with only
> 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML
> file as seen from the raw sql query.
> >>>
> >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went
> ahead and added it into the pattern match string and called it as the others
> are called in the query. Once the export completes I'll confirm that in fact
> we have captured all the id's. Once confirmed, I will move onto the query to
> find id's with underscores and handle them as mentioned above.
> >>>
> >>> Richard
> >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...>
> wrote:
> >>> Thanks for info. I will dig into this after my 10 am exam tomorrow.
> >>>
> >>> Richard
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> More information on the underscore issue:
> >>>>
> >>>> There is an ID with the pattern
> >>>>
> >>>> PFA_[0-9][0-9][0-9][0-9][wc]
> >>>>
> >>>> that needs to have the underscore removed so that reads instead
> >>>>
> >>>> PFA[0-9][0-9][0-9][0-9][wc]
> >>>>
> >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are
> there without the underscore and won't be recognized with it. I think we
> should leave the underscore ones there, but also have a set without the
> underscore. There are 134 records that have this issue.
> >>>>
> >>>> If Rich can make these two fixes (capturing the ORFs and dealing with
> the underscore), then I think we will be good to go with Plasmodium. There
> may be code in the Vibrio or Helicobacter profiles to help with the
> underscores, but I'm not sure.
> >>>>
> >>>> Best,
> >>>> Kam
> >>>>
> >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I've taken a look at the list of IDs and did a quick comparison with
> both the older released gdb and also a list I downloaded from the Broad
> Institute Plasmodium database. I think we can safely go with the query on
> the ORF tag for our export--all of those different ID forms are valid.
> There are about 400 IDs that are different in the older released gdb than
> in the new query; I'm going to further investigate those. I suspect that
> the difference is mainly due to a +/- underscore issue that we might need to
> solve. However, we should go forward with capturing all the IDs from the
> ORF tag, I don't see a need to restrict to a particular pattern there.
> >>>>>
> >>>>> Best,
> >>>>> Kam
> >>>>>
> >>>>> At 09:48 PM 3/14/2011, you wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> So I went ahead and did raw sql queries of the Postgres data and
> turned up the following:
> >>>>>>
> >>>>>> select * from genenametype where type = 'ordered locus'
> >>>>>> Returned zero gene ids
> >>>>>>
> >>>>>> select * from genenametype where type = 'ORF'
> >>>>>> Returned 5345 gene ids
> >>>>>> The type = 'ORF' query was exported into excel and posted to the
> biodb wiki on the Spring 2011 Plasmodium page.
> >>>>>>
> >>>>>> There are many many patterns in regards to gene ids, here the the
> prefixes from my cursory look:
> >>>>>> MAL
> >>>>>> PF##_
> >>>>>> PFA
> >>>>>> PFB
> >>>>>> PFC
> >>>>>> PFD
> >>>>>> PFE
> >>>>>> PFF
> >>>>>> PFI
> >>>>>> PFL
> >>>>>>
> >>>>>> Richard
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...>
> wrote:
> >>>>>> Hi,
> >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that
> it appears that the IDs are found in the ORF tag, not the OrderedLocus tag
> (except for the one that got captured in the export).
> >>>>>> Best,
> >>>>>> Kam
> >>>>>> At 08:09 AM 3/14/2011, you wrote:
> >>>>>>> Thanks Dondi,
> >>>>>>>
> >>>>>>> Will review this after our call today. I have been a little worried
> as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5
> Gb of log files so far... /yikes
> >>>>>>>
> >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be
> working from home afterwards instead of in the bio lab.
> >>>>>>>
> >>>>>>> Richard
> >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <
> do...@lm...> wrote:
> >>>>>>> Thanks for the updates, Rich.
> >>>>>>> I gave things a once-over and may have a lead. Here is what I
> found:
> >>>>>>> - First, the TallyEngine customization for P. falciparum states the
> following:
> >>>>>>> # Plasmodium falciparum
> >>>>>>> plasmodiumfalciparum_level_amount=2
> >>>>>>>
> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF
> >>>>>>>
> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene
> >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype
> where type = 'ORF';
> >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype
> where type = 'UniGene';
> >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus
> >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene
> >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are
> the gene names whose type is 'ORF' ("level0" properties).
> >>>>>>> - Now, this is what the P. falciparum species profile does when
> harvesting IDs
> (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations):
> >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid,
> c.value " +
> >>>>>>> "from genenametype c inner join entrytype_genetype d " +
> >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " +
> >>>>>>> "where (c.value similar to ? " +
> >>>>>>> "or c.value similar to ? " +
> >>>>>>> "or c.value similar to ?) " +
> >>>>>>> "and type <> 'ordered locus names' " +
> >>>>>>> "and type <> 'ORF' " +
> >>>>>>> "group by d.entrytype_gene_hjid, c.value";
> >>>>>>> Note the condition on the second-to-last line --- the query
> actually *omits* gene names whose type is 'ORF'! So the question is...which
> is right? (I'm inclined to believe the Tally Engine here, since, the export
> puts only one record in OrderedLocusNames)
> >>>>>>> Still, comparing these two queries directly against the PostgreSQL
> database would be educational, I think. Then, knowing which criteria are
> correct, the appropriate action can then be taken, I think.
> >>>>>>> Hope this helps...
> >>>>>>> John David N. Dionisio, PhD
> >>>>>>> Associate Professor, Computer Science
> >>>>>>> Loyola Marymount University
> >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote:
> >>>>>>> > Debug export is still going... 2.5GB of log files so far with
> progress at 65%...
> >>>>>>> >
> >>>>>>> > I posted the link of the WARN log on the plasmodium page here:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum .
> >>>>>>> > Richard
> >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <
> rbr...@gm...> wrote:
> >>>>>>> > Hi all,
> >>>>>>> >
> >>>>>>> > Have been working through several Plasmodium gdb exports in an
> attempt to source why only one gene id makes it into the Ordered Locus
> table.
> >>>>>>> >
> >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't
> able to determine anything which would suggest an error. I will post this
> log file to the wiki later today when I get home.
> >>>>>>> >
> >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to
> 100MB with hopes that more detail will surface the issue, but my export is
> on hour 20 and still going (although its nearly complete). What I didn't
> expect was the size of the log files and that it seems only the last 3 are
> kept with earlier logs being overwritten =( I fear that the information I
> need it in one of the earlier files which are now lost.
> >>>>>>> >
> >>>>>>> > Unless a better suggestion is offered I'm going to rerun an
> export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each
> and hope that 3 GB total will be enough to hold the complete export log.
> >>>>>>> >
> >>>>>>> > More info as it comes...
> >>>>>>> >
> >>>>>>> > Richard
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <
> kda...@lm...> wrote:
> >>>>>>> > Hi,
> >>>>>>> >
> >>>>>>> > I've completed testing the Plasmodium gdb I exported last
> November and updated the SourceForge wiki.
> >>>>>>> >
> >>>>>>> > Plasmodium has it's own task list page, which I've updated here:
>
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List
> >>>>>>> >
> >>>>>>> > The testing report can be found here:
> https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115
> >>>>>>> >
> >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page
> on the Fall 2010 BiolDB wiki:
> https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum
> >>>>>>> >
> >>>>>>> > Here is the list of bugs/action items that I've listed:
> >>>>>>> >
> >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of
> 5345 repored by the TallyEngine. This also affects all other tables related
> to OrderedLocusNames.
> >>>>>>> >
> >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than
> reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium
> tuberculosis also have mysterious GeneId issues with the TallyEngine).
> >>>>>>> >
> >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower
> than the 2009 version of the gdb. There's no way to tell at this point
> whether this is due to a change in annotation by UniProt or is a bug with
> GenMAPP Builder.
> >>>>>>> >
> >>>>>>> > Thanks,
> >>>>>>> > Kam
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> ------------------------------------------------------------------------------
> >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You
> >>>>>>> > This paper provides an overview of data connectivity, details
> >>>>>>> > its effect on application quality, and explores various
> alternative
> >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d
> >>>>>>> > _______________________________________________
> >>>>>>> > xmlpipedb-developer mailing list
> >>>>>>> > xml...@li...
> >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > <ATT00001..txt><ATT00002..txt>
> >>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>> Colocation vs. Managed Hosting
> >>>>>>> A question and answer guide to determining the best fit
> >>>>>>> for your organization - today and in the future.
> >>>>>>> http://p.sf.net/sfu/internap-sfd2d
> >>>>>>> _______________________________________________
> >>>>>>> xmlpipedb-developer mailing list
> >>>>>>> xml...@li...
> >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>
> ------------------------------------------------------------------------------
> >> Colocation vs. Managed Hosting
> >> A question and answer guide to determining the best fit
> >> for your organization - today and in the future.
> >> http://p.sf.net/sfu/internap-sfd2d
> >> _______________________________________________
> >> xmlpipedb-developer mailing list
> >> xml...@li...
> >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>
> >>
> ------------------------------------------------------------------------------
> >> Colocation vs. Managed Hosting
> >> A question and answer guide to determining the best fit
> >> for your organization - today and in the future.
> >> http://p.sf.net/sfu/internap-sfd2d
> >> _______________________________________________
> >> xmlpipedb-developer mailing list
> >> xml...@li...
> >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Colocation vs. Managed Hosting
> >> A question and answer guide to determining the best fit
> >> for your organization - today and in the future.
> >> http://p.sf.net/sfu/internap-sfd2d
> >> _______________________________________________
> >> xmlpipedb-developer mailing list
> >> xml...@li...
> >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
> >>
> >>
> > <ATT00001..txt><ATT00002..txt>
>
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> xmlpipedb-developer mailing list
> xml...@li...
> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer
>
|