xmlpipedb-developer Mailing List for XMLPipeDB (Page 12)
Brought to you by:
kdahlquist,
zugzugglug
You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(16) |
Sep
|
Oct
(9) |
Nov
(3) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(2) |
Feb
(8) |
Mar
|
Apr
(22) |
May
(1) |
Jun
|
Jul
|
Aug
(3) |
Sep
(32) |
Oct
(2) |
Nov
|
Dec
|
2011 |
Jan
|
Feb
(60) |
Mar
(42) |
Apr
(35) |
May
(17) |
Jun
(2) |
Jul
(23) |
Aug
(72) |
Sep
(15) |
Oct
(10) |
Nov
(14) |
Dec
(4) |
2012 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(11) |
Dec
|
2014 |
Jan
(1) |
Feb
(12) |
Mar
(14) |
Apr
(8) |
May
|
Jun
(14) |
Jul
(2) |
Aug
|
Sep
(5) |
Oct
(6) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
(5) |
Mar
(2) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: John D. N. D. <do...@lm...> - 2011-03-29 19:38:58
|
Hi Rich, The unknown with this error is whether this is a limitation on the server side (SourceForge) or client side (your copy of Subclipse/Subversion). If the former, that will need some investigation. Let's try with the latter case first: - Make sure you have the latest Subclipse/Subversion. According to Subclipse, they are at 1.6.17. - You might need to re-check out the projects if you updated Subclipse versions. - Try again. If the error persists, then this may be a server side issue, in which the action may be to *downgrade* the Subversion version. But we'll cross that bridge if/when we get to it. Hope this helps... John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 29, 2011, at 10:13 AM, Richard Brous wrote: > Hi Dondi, > > I'm working on the merge and have located the merge functionality under the team menu in Eclipse. > > The first menu requests a choice regarding the type of merge required. > I would expect the correct selection to be reintegrate branch. > > I then select the branch to merge and ask for conflicts of all types be noted for later review. > > Click finish and it attempts to merge but I receive an error message of the following: > org.tigris.subversion.javahl.ClientException: Trying to use an unsupported feature > svn: Querying mergeinfo requires version 3 of the FSFS filesystem schema; filesystem '/nfs/sf-svn-symlinks/xmlpipedb/db' uses only version 2 > org.tigris.subversion.javahl.ClientException: Trying to use an unsupported feature > svn: Querying mergeinfo requires version 3 of the FSFS filesystem schema; filesystem '/nfs/sf-svn-symlinks/xmlpipedb/db' uses only version 2 > Help please! > > Thanks. > > Richard > On Fri, Mar 25, 2011 at 3:50 PM, Richard Brous <rbr...@gm...> wrote: > ok, beta 63 is up on sourceforge. > > Richard > > On Fri, Mar 25, 2011 at 2:55 PM, Richard Brous <rbr...@gm...> wrote: > will do. i'll send out an email when completed. > > On Fri, Mar 25, 2011 at 11:30 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I've checked the gdb and it looks OK. Richard, will you release a build of GenMAPP Builder with these changes so that I can test the export on one of my machines? I am going to work on the ReadMe for this gdb so we can release it. > > Once I've validated that the new version of GenMAPP Builder works, we'll be ready to move on to the code merge. > > Best, > Dr. D > > > At 09:15 PM 3/24/2011, you wrote: >> Sorry for the late reply... been laid up with a cold and barely functioning on cold medicine... >> >> I believe I have the Pfalciparum solved by using the following code: >> >> >> /** >> >> * >> >> @see edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, java.util.Date) >> >> */ >> >> @Override >> >> public TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { >> >> // Start with the default OrderedLocusNames behavior. >> >> TableManager result = >> super .getSystemTableManagerCustomizations(tableManager, primarySystemTableManager, version); >> >> // Next, we add IDs from the other gene/name tags, but ONLY if they match >> >> // the pattern PF[A-Z][0-9]{4}[a-z]. >> >> //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; >> >> //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; >> >> //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; >> >> String sqlQuery = >> "select d.entrytype_gene_hjid as hjid, c.value " + >> >> "from genenametype c inner join entrytype_genetype d " + >> >> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >> >> //"where (c.value similar to ? " + >> >> "where c.type = 'ORF'" + >> >> //"or c.value similar to ? " + >> >> //"or c.value similar to ?) " + >> >> //"and type <> 'ordered locus names' " + >> >> //"and type <> 'ORF' " + >> >> "group by d.entrytype_gene_hjid, c.value"; >> >> String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); >> >> Connection c = ConnectionManager.getRelationalDBConnection(); >> >> PreparedStatement ps; >> >> ResultSet rs; >> >> try { >> >> // Query, iterate, add to table manager. >> >> ps = c.prepareStatement(sqlQuery); >> >> //ps.setString(1, pfID); >> >> //ps.setString(2, pfID2); >> >> //ps.setString(3, pfID3); >> >> rs = ps.executeQuery(); >> >> while (rs.next()) { >> >> String hjid = Long.valueOf(rs.getLong( >> "hjid" )).toString(); >> >> // capture unmodified value in string variable id >> >> // We want to remove the '_' here but also add id's with the '_' >> >> String id = rs.getString( >> "value" ); >> >> // condition if id matched pattern "PFA_" >> >> if (id.matches("PFA_.*" )) { >> >> String[] substrings = id.split("/" ); >> >> String new_id = null; >> >> String old_id = id; >> >> for (int i = 0; i < substrings.length ; i++) { >> >> new_id = substrings[i].replace("_" , ""); >> >> _Log.debug( "Remove '_' from " + id + " to create: " + new_id + " for surrogate " + hjid); >> >> result.submit( "OrderedLocusNames", QueryType.insert , new String[][] { { "ID", new_id }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , dateToday }, { "UID", hjid } }); >> >> _Log.debug( "Keep '_' from " + id + " to create: " + old_id + " for surrogate " + hjid); >> >> result.submit( "OrderedLocusNames", QueryType.insert , new String[][] { { "ID", old_id }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , dateToday }, { "UID", hjid } }); >> >> } >> >> } >> >> // otherwise process as normal >> >> else { >> >> _Log .debug("Processing raw ID: " + id + " for surrogate " + hjid); >> >> tableManager.submit( >> "OrderedLocusNames" , QueryType.insert , new String[][] { { "ID", id }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , dateToday }, { "UID", hjid } }); >> >> } >> >> } >> >> } >> catch (SQLException sqlexc) { >> >> logSQLException(sqlexc, sqlQuery); >> >> } >> >> return result; >> >> } >> >> I have exported a gdb which contains 5472 gene id's as compared to the previous 5338. The 134 gene id's with the prefix PFA_ are included plus their underscore removed form. >> >> Zipped up and attached here and will later post to the wiki. >> >> Richard > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-03-29 17:13:55
|
Hi Dondi, I'm working on the merge and have located the merge functionality under the team menu in Eclipse. The first menu requests a choice regarding the type of merge required. I would expect the correct selection to be reintegrate branch. I then select the branch to merge and ask for conflicts of all types be noted for later review. Click finish and it attempts to merge but I receive an error message of the following: org.tigris.subversion.javahl.ClientException: Trying to use an unsupported feature svn: Querying mergeinfo requires version 3 of the FSFS filesystem schema; filesystem '/nfs/sf-svn-symlinks/xmlpipedb/db' uses only version 2 org.tigris.subversion.javahl.ClientException: Trying to use an unsupported feature svn: Querying mergeinfo requires version 3 of the FSFS filesystem schema; filesystem '/nfs/sf-svn-symlinks/xmlpipedb/db' uses only version 2 Help please! Thanks. Richard On Fri, Mar 25, 2011 at 3:50 PM, Richard Brous <rbr...@gm...> wrote: > ok, beta 63 is up on sourceforge. > > Richard > > On Fri, Mar 25, 2011 at 2:55 PM, Richard Brous <rbr...@gm...>wrote: > >> will do. i'll send out an email when completed. >> >> On Fri, Mar 25, 2011 at 11:30 AM, Kam Dahlquist <kda...@lm...>wrote: >> >>> Hi, >>> >>> I've checked the gdb and it looks OK. Richard, will you release a build >>> of GenMAPP Builder with these changes so that I can test the export on one >>> of my machines? I am going to work on the ReadMe for this gdb so we can >>> release it. >>> >>> Once I've validated that the new version of GenMAPP Builder works, we'll >>> be ready to move on to the code merge. >>> >>> Best, >>> Dr. D >>> >>> >>> At 09:15 PM 3/24/2011, you wrote: >>> >>> Sorry for the late reply... been laid up with a cold and barely >>> functioning on cold medicine... >>> >>> I believe I have the Pfalciparum solved by using the following code: >>> >>> >>> /** >>> >>> * >>> >>> *@see* edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >>> edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >>> java.util.Date) >>> >>> */ >>> >>> @Override >>> >>> *public* TableManager getSystemTableManagerCustomizations(TableManager >>> tableManager, TableManager primarySystemTableManager, Date version) * >>> throws* SQLException, InvalidParameterException { >>> >>> // Start with the default OrderedLocusNames behavior. >>> >>> TableManager result = >>> *super* .getSystemTableManagerCustomizations(tableManager, >>> primarySystemTableManager, version); >>> >>> // Next, we add IDs from the other gene/name tags, but ONLY if they match >>> >>> // the pattern PF[A-Z][0-9]{4}[a-z]. >>> >>> //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; >>> >>> //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; >>> >>> //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; >>> >>> String sqlQuery = >>> "select d.entrytype_gene_hjid as hjid, c.value " + >>> >>> "from genenametype c inner join entrytype_genetype d " + >>> >>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>> >>> //"where (c.value similar to ? " + >>> >>> "where c.type = 'ORF'" + >>> >>> //"or c.value similar to ? " + >>> >>> //"or c.value similar to ?) " + >>> >>> //"and type <> 'ordered locus names' " + >>> >>> //"and type <> 'ORF' " + >>> >>> "group by d.entrytype_gene_hjid, c.value"; >>> >>> String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString* >>> (version); >>> >>> Connection c = ConnectionManager.*getRelationalDBConnection*(); >>> >>> PreparedStatement ps; >>> >>> ResultSet rs; >>> >>> *try* { >>> >>> // Query, iterate, add to table manager. >>> >>> ps = c.prepareStatement(sqlQuery); >>> >>> //ps.setString(1, pfID); >>> >>> //ps.setString(2, pfID2); >>> >>> //ps.setString(3, pfID3); >>> >>> rs = ps.executeQuery(); >>> >>> *while* (rs.next()) { >>> >>> String hjid = Long.*valueOf*(rs.getLong( >>> "hjid" )).toString(); >>> >>> // capture unmodified value in string variable id >>> >>> // We want to remove the '_' here but also add id's with the '_' >>> >>> String id = rs.getString( >>> "value" ); >>> >>> // condition if id matched pattern "PFA_" >>> >>> *if* (id.matches("PFA_.*" )) { >>> >>> String[] substrings = id.split("/" ); >>> >>> String new_id = *null*; >>> >>> String old_id = id; >>> >>> *for* (*int* i = 0; i < substrings.length ; i++) { >>> >>> new_id = substrings[i].replace("_" , ""); >>> >>> *_Log*.debug( "Remove '_' from " + id + " to create: " + new_id + " for >>> surrogate " + hjid); >>> >>> result.submit( "OrderedLocusNames", QueryType.*insert* , *new*String[][] { { >>> "ID", new_id }, { "Species" , "|" + getSpeciesName() + "|" }, { >>> "\"Date\"" , dateToday }, { "UID", hjid } }); >>> >>> *_Log*.debug( "Keep '_' from " + id + " to create: " + old_id + " for >>> surrogate " + hjid); >>> >>> result.submit( "OrderedLocusNames", QueryType.*insert* , *new*String[][] { { >>> "ID", old_id }, { "Species" , "|" + getSpeciesName() + "|" }, { >>> "\"Date\"" , dateToday }, { "UID", hjid } }); >>> >>> } >>> >>> } >>> >>> // otherwise process as normal >>> >>> *else* { >>> >>> *_Log* .debug("Processing raw ID: " + id + " for surrogate " + hjid); >>> >>> tableManager.submit( >>> "OrderedLocusNames" , QueryType.*insert* , *new* String[][] { { "ID", id >>> }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , >>> dateToday }, { "UID", hjid } }); >>> >>> } >>> >>> } >>> >>> } >>> *catch* (SQLException sqlexc) { >>> >>> logSQLException(sqlexc, sqlQuery); >>> >>> } >>> >>> *return* result; >>> >>> } >>> >>> I have exported a gdb which contains 5472 gene id's as compared to the >>> previous 5338. The 134 gene id's with the prefix PFA_ are included plus >>> their underscore removed form. >>> >>> Zipped up and attached here and will later post to the wiki. >>> >>> Richard >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Enable your software for Intel(R) Active Management Technology to meet >>> the >>> growing manageability and security demands of your customers. Businesses >>> are taking advantage of Intel(R) vPro (TM) technology - will your >>> software >>> be a part of the solution? Download the Intel(R) Manageability Checker >>> today! http://p.sf.net/sfu/intel-dev2devmar >>> >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >> > |
From: Richard B. <rbr...@gm...> - 2011-03-25 22:50:57
|
ok, beta 63 is up on sourceforge. Richard On Fri, Mar 25, 2011 at 2:55 PM, Richard Brous <rbr...@gm...> wrote: > will do. i'll send out an email when completed. > > On Fri, Mar 25, 2011 at 11:30 AM, Kam Dahlquist <kda...@lm...>wrote: > >> Hi, >> >> I've checked the gdb and it looks OK. Richard, will you release a build >> of GenMAPP Builder with these changes so that I can test the export on one >> of my machines? I am going to work on the ReadMe for this gdb so we can >> release it. >> >> Once I've validated that the new version of GenMAPP Builder works, we'll >> be ready to move on to the code merge. >> >> Best, >> Dr. D >> >> >> At 09:15 PM 3/24/2011, you wrote: >> >> Sorry for the late reply... been laid up with a cold and barely >> functioning on cold medicine... >> >> I believe I have the Pfalciparum solved by using the following code: >> >> >> /** >> >> * >> >> *@see* edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >> edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >> java.util.Date) >> >> */ >> >> @Override >> >> *public* TableManager getSystemTableManagerCustomizations(TableManager >> tableManager, TableManager primarySystemTableManager, Date version) * >> throws* SQLException, InvalidParameterException { >> >> // Start with the default OrderedLocusNames behavior. >> >> TableManager result = >> *super* .getSystemTableManagerCustomizations(tableManager, >> primarySystemTableManager, version); >> >> // Next, we add IDs from the other gene/name tags, but ONLY if they match >> >> // the pattern PF[A-Z][0-9]{4}[a-z]. >> >> //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; >> >> //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; >> >> //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; >> >> String sqlQuery = >> "select d.entrytype_gene_hjid as hjid, c.value " + >> >> "from genenametype c inner join entrytype_genetype d " + >> >> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >> >> //"where (c.value similar to ? " + >> >> "where c.type = 'ORF'" + >> >> //"or c.value similar to ? " + >> >> //"or c.value similar to ?) " + >> >> //"and type <> 'ordered locus names' " + >> >> //"and type <> 'ORF' " + >> >> "group by d.entrytype_gene_hjid, c.value"; >> >> String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString* >> (version); >> >> Connection c = ConnectionManager.*getRelationalDBConnection*(); >> >> PreparedStatement ps; >> >> ResultSet rs; >> >> *try* { >> >> // Query, iterate, add to table manager. >> >> ps = c.prepareStatement(sqlQuery); >> >> //ps.setString(1, pfID); >> >> //ps.setString(2, pfID2); >> >> //ps.setString(3, pfID3); >> >> rs = ps.executeQuery(); >> >> *while* (rs.next()) { >> >> String hjid = Long.*valueOf*(rs.getLong( >> "hjid" )).toString(); >> >> // capture unmodified value in string variable id >> >> // We want to remove the '_' here but also add id's with the '_' >> >> String id = rs.getString( >> "value" ); >> >> // condition if id matched pattern "PFA_" >> >> *if* (id.matches("PFA_.*" )) { >> >> String[] substrings = id.split("/" ); >> >> String new_id = *null*; >> >> String old_id = id; >> >> *for* (*int* i = 0; i < substrings.length ; i++) { >> >> new_id = substrings[i].replace("_" , ""); >> >> *_Log*.debug( "Remove '_' from " + id + " to create: " + new_id + " for >> surrogate " + hjid); >> >> result.submit( "OrderedLocusNames", QueryType.*insert* , *new* String[][] >> { { "ID", new_id }, { "Species" , "|" + getSpeciesName() + "|" }, { >> "\"Date\"" , dateToday }, { "UID", hjid } }); >> >> *_Log*.debug( "Keep '_' from " + id + " to create: " + old_id + " for >> surrogate " + hjid); >> >> result.submit( "OrderedLocusNames", QueryType.*insert* , *new* String[][] >> { { "ID", old_id }, { "Species" , "|" + getSpeciesName() + "|" }, { >> "\"Date\"" , dateToday }, { "UID", hjid } }); >> >> } >> >> } >> >> // otherwise process as normal >> >> *else* { >> >> *_Log* .debug("Processing raw ID: " + id + " for surrogate " + hjid); >> >> tableManager.submit( >> "OrderedLocusNames" , QueryType.*insert* , *new* String[][] { { "ID", id >> }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , dateToday >> }, { "UID", hjid } }); >> >> } >> >> } >> >> } >> *catch* (SQLException sqlexc) { >> >> logSQLException(sqlexc, sqlQuery); >> >> } >> >> *return* result; >> >> } >> >> I have exported a gdb which contains 5472 gene id's as compared to the >> previous 5338. The 134 gene id's with the prefix PFA_ are included plus >> their underscore removed form. >> >> Zipped up and attached here and will later post to the wiki. >> >> Richard >> >> >> >> ------------------------------------------------------------------------------ >> Enable your software for Intel(R) Active Management Technology to meet the >> growing manageability and security demands of your customers. Businesses >> are taking advantage of Intel(R) vPro (TM) technology - will your software >> be a part of the solution? Download the Intel(R) Manageability Checker >> today! http://p.sf.net/sfu/intel-dev2devmar >> >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > |
From: Richard B. <rbr...@gm...> - 2011-03-25 21:55:48
|
will do. i'll send out an email when completed. On Fri, Mar 25, 2011 at 11:30 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I've checked the gdb and it looks OK. Richard, will you release a build of > GenMAPP Builder with these changes so that I can test the export on one of > my machines? I am going to work on the ReadMe for this gdb so we can > release it. > > Once I've validated that the new version of GenMAPP Builder works, we'll be > ready to move on to the code merge. > > Best, > Dr. D > > > At 09:15 PM 3/24/2011, you wrote: > > Sorry for the late reply... been laid up with a cold and barely functioning > on cold medicine... > > I believe I have the Pfalciparum solved by using the following code: > > > /** > > * > > *@see* edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > java.util.Date) > > */ > > @Override > > *public* TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) * > throws* SQLException, InvalidParameterException { > > // Start with the default OrderedLocusNames behavior. > > TableManager result = > *super* .getSystemTableManagerCustomizations(tableManager, > primarySystemTableManager, version); > > // Next, we add IDs from the other gene/name tags, but ONLY if they match > > // the pattern PF[A-Z][0-9]{4}[a-z]. > > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > > //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > > String sqlQuery = > "select d.entrytype_gene_hjid as hjid, c.value " + > > "from genenametype c inner join entrytype_genetype d " + > > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > //"where (c.value similar to ? " + > > "where c.type = 'ORF'" + > > //"or c.value similar to ? " + > > //"or c.value similar to ?) " + > > //"and type <> 'ordered locus names' " + > > //"and type <> 'ORF' " + > > "group by d.entrytype_gene_hjid, c.value"; > > String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString* > (version); > > Connection c = ConnectionManager.*getRelationalDBConnection*(); > > PreparedStatement ps; > > ResultSet rs; > > *try* { > > // Query, iterate, add to table manager. > > ps = c.prepareStatement(sqlQuery); > > //ps.setString(1, pfID); > > //ps.setString(2, pfID2); > > //ps.setString(3, pfID3); > > rs = ps.executeQuery(); > > *while* (rs.next()) { > > String hjid = Long.*valueOf*(rs.getLong( > "hjid" )).toString(); > > // capture unmodified value in string variable id > > // We want to remove the '_' here but also add id's with the '_' > > String id = rs.getString( > "value" ); > > // condition if id matched pattern "PFA_" > > *if* (id.matches("PFA_.*" )) { > > String[] substrings = id.split("/" ); > > String new_id = *null*; > > String old_id = id; > > *for* (*int* i = 0; i < substrings.length ; i++) { > > new_id = substrings[i].replace("_" , ""); > > *_Log*.debug( "Remove '_' from " + id + " to create: " + new_id + " for > surrogate " + hjid); > > result.submit( "OrderedLocusNames", QueryType.*insert* , *new* String[][] > { { "ID", new_id }, { "Species" , "|" + getSpeciesName() + "|" }, { > "\"Date\"" , dateToday }, { "UID", hjid } }); > > *_Log*.debug( "Keep '_' from " + id + " to create: " + old_id + " for > surrogate " + hjid); > > result.submit( "OrderedLocusNames", QueryType.*insert* , *new* String[][] > { { "ID", old_id }, { "Species" , "|" + getSpeciesName() + "|" }, { > "\"Date\"" , dateToday }, { "UID", hjid } }); > > } > > } > > // otherwise process as normal > > *else* { > > *_Log* .debug("Processing raw ID: " + id + " for surrogate " + hjid); > > tableManager.submit( > "OrderedLocusNames" , QueryType.*insert* , *new* String[][] { { "ID", id > }, { "Species" , "|" + getSpeciesName() + "|" }, { "\"Date\"" , dateToday > }, { "UID", hjid } }); > > } > > } > > } > *catch* (SQLException sqlexc) { > > logSQLException(sqlexc, sqlQuery); > > } > > *return* result; > > } > > I have exported a gdb which contains 5472 gene id's as compared to the > previous 5338. The 134 gene id's with the prefix PFA_ are included plus > their underscore removed form. > > Zipped up and attached here and will later post to the wiki. > > Richard > > > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-03-25 18:30:44
|
Hi, I've checked the gdb and it looks OK. Richard, will you release a build of GenMAPP Builder with these changes so that I can test the export on one of my machines? I am going to work on the ReadMe for this gdb so we can release it. Once I've validated that the new version of GenMAPP Builder works, we'll be ready to move on to the code merge. Best, Dr. D At 09:15 PM 3/24/2011, you wrote: >Sorry for the late reply... been laid up with a cold and barely >functioning on cold medicine... > >I believe I have the Pfalciparum solved by using the following code: > > >/** > >* > >@see >edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, >java.util.Date) > >*/ > >@Override > >public TableManager getSystemTableManagerCustomizations(TableManager >tableManager, TableManager primarySystemTableManager, Date version) >throws SQLException, InvalidParameterException { > >// Start with the default OrderedLocusNames behavior. > >TableManager result = >super.getSystemTableManagerCustomizations(tableManager, >primarySystemTableManager, version); > >// Next, we add IDs from the other gene/name tags, but ONLY if they match > >// the pattern PF[A-Z][0-9]{4}[a-z]. > >//final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > >//final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > >//final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > >String sqlQuery = >"select d.entrytype_gene_hjid as hjid, c.value " + > >"from genenametype c inner join entrytype_genetype d " + > >"on (c.entrytype_genetype_name_hjid = d.hjid) " + > >//"where (c.value similar to ? " + > >"where c.type = 'ORF'" + > >//"or c.value similar to ? " + > >//"or c.value similar to ?) " + > >//"and type <> 'ordered locus names' " + > >//"and type <> 'ORF' " + > >"group by d.entrytype_gene_hjid, c.value"; > >String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); > >Connection c = ConnectionManager.getRelationalDBConnection(); > >PreparedStatement ps; > >ResultSet rs; > >try { > >// Query, iterate, add to table manager. > >ps = c.prepareStatement(sqlQuery); > >//ps.setString(1, pfID); > >//ps.setString(2, pfID2); > >//ps.setString(3, pfID3); > >rs = ps.executeQuery(); > >while (rs.next()) { > >String hjid = Long.valueOf(rs.getLong( >"hjid")).toString(); > >// capture unmodified value in string variable id > >// We want to remove the '_' here but also add id's with the '_' > >String id = rs.getString( >"value"); > >// condition if id matched pattern "PFA_" > >if (id.matches("PFA_.*")) { > >String[] substrings = id.split("/"); > >String new_id = null; > >String old_id = id; > >for (int i = 0; i < substrings.length; i++) { > >new_id = substrings[i].replace("_", ""); > >_Log.debug("Remove '_' from " + id + " to create: " + new_id + " for >surrogate " + hjid); > >result.submit("OrderedLocusNames", QueryType.insert, new String[][] >{ { "ID", new_id }, { "Species", "|" + getSpeciesName() + "|" }, { >"\"Date\"", dateToday }, { "UID", hjid } }); > >_Log.debug("Keep '_' from " + id + " to create: " + old_id + " for >surrogate " + hjid); > >result.submit("OrderedLocusNames", QueryType.insert, new String[][] >{ { "ID", old_id }, { "Species", "|" + getSpeciesName() + "|" }, { >"\"Date\"", dateToday }, { "UID", hjid } }); > >} > >} > >// otherwise process as normal > >else { > >_Log.debug("Processing raw ID: " + id + " for surrogate " + hjid); > >tableManager.submit( >"OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id >}, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", >dateToday }, { "UID", hjid } }); > >} > >} > >} >catch(SQLException sqlexc) { > >logSQLException(sqlexc, sqlQuery); > >} > >return result; > >} > >I have exported a gdb which contains 5472 gene id's as compared to >the previous 5338. The 134 gene id's with the prefix PFA_ are >included plus their underscore removed form. > >Zipped up and attached here and will later post to the wiki. > >Richard |
From: Kam D. <kda...@lm...> - 2011-03-21 04:37:16
|
Hi, That's OK with me. Best, Kam At 09:22 PM 3/20/2011, you wrote: >I vote that, since the duplication lies at the raw data, we can >leave things as is, and just state as much in the README. > >For the underscores, you can look at the V. cholerae species profile >to get an idea for how to do it, since that species profile has to >do the same thing with its IDs. > >John David N. Dionisio, PhD >Associate Professor, Computer Science >Loyola Marymount University > > > >On Mar 20, 2011, at 9:19 PM, Richard Brous wrote: > > > I did another export after re-enabling the group by and the > results are the same at 5338 gene id's. At least we know either way now. > > > > Moving forward are we just going to leave the duplicate id > situation as is or Dondi can you think of an option here? > > > > Then the last item will be to keep the original underscore id's > but also remove the underscores and add to the 5338. > > > > Richard > > > > On Sun, Mar 20, 2011 at 6:51 PM, Kam Dahlquist <kda...@lm...> wrote: > > Hi, > > > > I looked up those 7 IDs in UniProt, and they each are found in > 2-3 separate UniProt entries, which would explain why they had separate hjids. > > > > Probably what GenMAPP Builder is doing is taking the first one > and relating it to its UniProt ID and then discarding the second > (or third) relationship. I don't know that there's anything we can > do about this since it is a problem with the raw data. > > > > Best, > > Dr. D > > > > > > > > At 12:15 PM 3/20/2011, Richard Brous wrote: > >> Having weird probs with gmail in browser where the reply button > isn't working so have to send from my phone... Odd > >> > >> So I here is my analysis on the ORF only export gdb: > >> > >> Initially the raw SQL query returned 5345 records of which I > posted an excel doc link to the plasmodium page on the biodb wiki. > >> > >> The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > >> > >> What I found were 7 id's that were duplicated in the raw SQL > query of postgres but were not exported into the gdb. > >> > >> The id's are: > >> PF10_0168 > >> PF11_0361 (duplicated 2x for total of 3 entries) > >> PF11_0377 > >> PF11_0405 > >> PFB0305c > >> PFB0391c > >> > >> Surprisingly (to me anyway) I noticed that the duplicates all > had unique hjid's, which may or may not mean anything... > >> > >> In conclusion, it seems to me that the 5338 id's in the gdb are > likely correct. > >> > >> Dr. D, does that make sense? > >> > >> Richard > >> > >> Sent from my iPhone > >> > >> On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > >> > >>> Yes I understand that you only want the 'ORF' but was trying to > get what we need by modifying what we have and not rewite the whole query. > >>> > >>> I may in fact have to rewrite the whole thing anyway as my > query didn't return what was expected =/ > >>> > >>> Also I'm glad you mentioned a possible issue with MAL pattern, > I'll keep an eye out for missing id's here. > >>> > >>> Richard > >>> > >>> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist > <kda...@lm...> wrote: > >>> Hi, > >>> > >>> I think we need to capture all of the IDs in the ORF tag and > *NOT* do the pattern match at all. As far as I can tell with my > analysis of the IDs in the query you posted, we need to keep them > all, so we don't actually need to specify the patterns at this > point. I would rather do that thinking towards the future when the > Plasmodium people might add new patterns to the ID system. > >>> > >>> I believe that the > >>> > >>> > >>> "MAL[0-9]*P1.[0-9]*" > >>> > >>> > >>> > >>> pattern is also not pulling out everything it needs to. But > >>> instead > >>> > >>> > >>> of including more patterns, I would rather just loosen up the > >>> criteria to > >>> > >>> > >>> include all things in the ORF tag. > >>> > >>> > >>> > >>> Also, I just want to be clear about the underscore issue. That > >>> only > >>> > >>> > >>> affects IDs that begin with PFA, not the other IDs that begin with > >>> PF##_ > >>> > >>> > >>> > >>> Thanks, > >>> > >>> > >>> Kam > >>> > >>> > >>> > >>> > >>> At 12:47 PM 3/18/2011, Richard Brous wrote: > >>>> I'm down in the bio lab at the moment looking at this. > >>>> > >>>> I understand what needs to be done in regards to 1) keeping > all ORF id's and then 2) querying the id's with underscores to then > remove the underscores but maintaining the original underscore id's. > >>>> > >>>> I performed an export with the pattern match as-is and > commented out the exclude 'ordered locus' and 'ORF'. The export > completed but with only 5110 gene id's. So it seems we are missing > 235 gene id's that are in the XML file as seen from the raw sql query. > >>>> > >>>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] > I went ahead and added it into the pattern match string and called > it as the others are called in the query. Once the export completes > I'll confirm that in fact we have captured all the id's. Once > confirmed, I will move onto the query to find id's with underscores > and handle them as mentioned above. > >>>> > >>>> Richard > >>>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous > <rbr...@gm...> wrote: > >>>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > >>>> Richard > >>>> Sent from my iPhone > >>>> > >>>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > >>>> > >>>>> Hi, > >>>>> More information on the underscore issue: > >>>>> There is an ID with the pattern > >>>>> PFA_[0-9][0-9][0-9][0-9][wc] > >>>>> that needs to have the underscore removed so that reads instead > >>>>> PFA[0-9][0-9][0-9][0-9][wc] > >>>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, > they are there without the underscore and won't be recognized with > it. I think we should leave the underscore ones there, but also > have a set without the underscore. There are 134 records that have this issue. > >>>>> If Rich can make these two fixes (capturing the ORFs and > dealing with the underscore), then I think we will be good to go > with Plasmodium. There may be code in the Vibrio or Helicobacter > profiles to help with the underscores, but I'm not sure. > >>>>> Best, > >>>>> Kam > >>>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > >>>>>> Hi, > >>>>>> I've taken a look at the list of IDs and did a quick > comparison with both the older released gdb and also a list I > downloaded from the Broad Institute Plasmodium database. I think > we can safely go with the query on the ORF tag for our export--all > of those different ID forms are valid. There are about 400 IDs > that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the > difference is mainly due to a +/- underscore issue that we might > need to solve. However, we should go forward with capturing all > the IDs from the ORF tag, I don't see a need to restrict to a > particular pattern there. > >>>>>> Best, > >>>>>> Kam > >>>>>> At 09:48 PM 3/14/2011, you wrote: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> So I went ahead and did raw sql queries of the Postgres > data and turned up the following: > >>>>>>> > >>>>>>> select * from genenametype where type = 'ordered locus' > >>>>>>> Returned zero gene ids > >>>>>>> > >>>>>>> select * from genenametype where type = 'ORF' > >>>>>>> Returned 5345 gene ids > >>>>>>> The type = 'ORF' query was exported into excel and posted > to the biodb wiki on the Spring 2011 Plasmodium page. > >>>>>>> > >>>>>>> There are many many patterns in regards to gene ids, here > the the prefixes from my cursory look: > >>>>>>> MAL > >>>>>>> PF##_ > >>>>>>> PFA > >>>>>>> PFB > >>>>>>> PFC > >>>>>>> PFD > >>>>>>> PFE > >>>>>>> PFF > >>>>>>> PFI > >>>>>>> PFL > >>>>>>> Richard > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist > <kda...@lm...> wrote: > >>>>>>> Hi, > >>>>>>> I looked up an assortment of IDs in UniProt and I can > confirm that it appears that the IDs are found in the ORF tag, not > the OrderedLocus tag (except for the one that got captured in the export). > >>>>>>> Best, > >>>>>>> Kam > >>>>>>> At 08:09 AM 3/14/2011, you wrote: > >>>>>>>> Thanks Dondi, > >>>>>>>> > >>>>>>>> Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with > progress at 65% and 6.5 Gb of log files so far... /yikes > >>>>>>>> > >>>>>>>> Btw I have a work lunch meeting in Beverly Hills today so > will be working from home afterwards instead of in the bio lab. > >>>>>>>> > >>>>>>>> Richard > >>>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio > <do...@lm...> wrote: > >>>>>>>> Thanks for the updates, Rich. > >>>>>>>> I gave things a once-over and may have a lead. Here is > what I found: > >>>>>>>> - First, the TallyEngine customization for P. falciparum > states the following: > >>>>>>>> # Plasmodium falciparum > >>>>>>>> plasmodiumfalciparum_level_amount=2 > >>>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > >>>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >>>>>>>> plasmodiumfalciparum_query_level0=select count(*) from > genenametype where type = 'ORF'; > >>>>>>>> plasmodiumfalciparum_query_level1=select count(*) from > genenametype where type = 'UniGene'; > >>>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > >>>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > >>>>>>>> Thus, what is being counted by TallyEngine as "Ordered > Locus" are the gene names whose type is 'ORF' ("level0" properties). > >>>>>>>> - Now, this is what the P. falciparum species profile does > when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > >>>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as > hjid, c.value " + > >>>>>>>> "from genenametype c inner join entrytype_genetype d " + > >>>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > >>>>>>>> "where (c.value similar to ? " + > >>>>>>>> "or c.value similar to ? " + > >>>>>>>> "or c.value similar to ?) " + > >>>>>>>> "and type <> 'ordered locus names' " + > >>>>>>>> "and type <> 'ORF' " + > >>>>>>>> "group by d.entrytype_gene_hjid, c.value"; > >>>>>>>> Note the condition on the second-to-last line --- the > query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally > Engine here, since, the export puts only one record in OrderedLocusNames) > >>>>>>>> Still, comparing these two queries directly against the > PostgreSQL database would be educational, I think. Then, knowing > which criteria are correct, the appropriate action can then be taken, I think. > >>>>>>>> Hope this helps... > >>>>>>>> John David N. Dionisio, PhD > >>>>>>>> Associate Professor, Computer Science > >>>>>>>> Loyola Marymount University > >>>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > >>>>>>>> > Debug export is still going... 2.5GB of log files so far > with progress at 65%... > >>>>>>>> > > >>>>>>>> > I posted the link of the WARN log on the plasmodium page > here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > >>>>>>>> > Richard > >>>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous > <rbr...@gm...> wrote: > >>>>>>>> > Hi all, > >>>>>>>> > > >>>>>>>> > Have been working through several Plasmodium gdb exports > in an attempt to source why only one gene id makes it into the > Ordered Locus table. > >>>>>>>> > > >>>>>>>> > I have reviewed the logger file while set to "WARN" and > wasn't able to determine anything which would suggest an error. I > will post this log file to the wiki later today when I get home. > >>>>>>>> > > >>>>>>>> > I then upped the logger verbosity to "DEBUG" and file > size to 100MB with hopes that more detail will surface the issue, > but my export is on hour 20 and still going (although its nearly > complete). What I didn't expect was the size of the log files and > that it seems only the last 3 are kept with earlier logs being > overwritten =( I fear that the information I need it in one of the > earlier files which are now lost. > >>>>>>>> > > >>>>>>>> > Unless a better suggestion is offered I'm going to rerun > an export again with 'DEBUG" verbosity and up the file sizes to > near 1 GB each and hope that 3 GB total will be enough to hold the > complete export log. > >>>>>>>> > > >>>>>>>> > More info as it comes... > >>>>>>>> > > >>>>>>>> > Richard > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist > <kda...@lm...> wrote: > >>>>>>>> > Hi, > >>>>>>>> > > >>>>>>>> > I've completed testing the Plasmodium gdb I exported > last November and updated the SourceForge wiki. > >>>>>>>> > > >>>>>>>> > Plasmodium has it's own task list page, which I've > updated > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > >>>>>>>> > > >>>>>>>> > The testing report can be found > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > >>>>>>>> > > >>>>>>>> > The source files and gdb are on a new Plasmodium > falciparum page on the Fall 2010 BiolDB > wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >>>>>>>> > > >>>>>>>> > Here is the list of bugs/action items that I've listed: > >>>>>>>> > > >>>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID > out of 5345 repored by the TallyEngine. This also affects all other > tables related to OrderedLocusNames. > >>>>>>>> > > >>>>>>>> > 2. The GeneId table in the database has 6 fewer IDs > than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with > the TallyEngine). > >>>>>>>> > > >>>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell > at this point whether this is due to a change in annotation by > UniProt or is a bug with GenMAPP Builder. > >>>>>>>> > > >>>>>>>> > Thanks, > >>>>>>>> > Kam > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > ------------------------------------------------------------------------------ > >>>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > >>>>>>>> > This paper provides an overview of data connectivity, details > >>>>>>>> > its effect on application quality, and explores various > alternative > >>>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > >>>>>>>> > _______________________________________________ > >>>>>>>> > xmlpipedb-developer mailing list > >>>>>>>> > xml...@li... > >>>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > <ATT00001..txt><ATT00002..txt> > >>>>>>>> > ------------------------------------------------------------------------------ > >>>>>>>> Colocation vs. Managed Hosting > >>>>>>>> A question and answer guide to determining the best fit > >>>>>>>> for your organization - today and in the future. > >>>>>>>> http://p.sf.net/sfu/internap-sfd2d > >>>>>>>> _______________________________________________ > >>>>>>>> xmlpipedb-developer mailing list > >>>>>>>> xml...@li... > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > <ATT00001..txt><ATT00002..txt> > > >------------------------------------------------------------------------------ >Colocation vs. Managed Hosting >A question and answer guide to determining the best fit >for your organization - today and in the future. >http://p.sf.net/sfu/internap-sfd2d >_______________________________________________ >xmlpipedb-developer mailing list >xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer |
From: John D. N. D. <do...@lm...> - 2011-03-21 04:22:35
|
I vote that, since the duplication lies at the raw data, we can leave things as is, and just state as much in the README. For the underscores, you can look at the V. cholerae species profile to get an idea for how to do it, since that species profile has to do the same thing with its IDs. John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 20, 2011, at 9:19 PM, Richard Brous wrote: > I did another export after re-enabling the group by and the results are the same at 5338 gene id's. At least we know either way now. > > Moving forward are we just going to leave the duplicate id situation as is or Dondi can you think of an option here? > > Then the last item will be to keep the original underscore id's but also remove the underscores and add to the 5338. > > Richard > > On Sun, Mar 20, 2011 at 6:51 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I looked up those 7 IDs in UniProt, and they each are found in 2-3 separate UniProt entries, which would explain why they had separate hjids. > > Probably what GenMAPP Builder is doing is taking the first one and relating it to its UniProt ID and then discarding the second (or third) relationship. I don't know that there's anything we can do about this since it is a problem with the raw data. > > Best, > Dr. D > > > > At 12:15 PM 3/20/2011, Richard Brous wrote: >> Having weird probs with gmail in browser where the reply button isn't working so have to send from my phone... Odd >> >> So I here is my analysis on the ORF only export gdb: >> >> Initially the raw SQL query returned 5345 records of which I posted an excel doc link to the plasmodium page on the biodb wiki. >> >> The gdb contained only 5338 genes id's so I started looking for duplicates or exclusions. >> >> What I found were 7 id's that were duplicated in the raw SQL query of postgres but were not exported into the gdb. >> >> The id's are: >> PF10_0168 >> PF11_0361 (duplicated 2x for total of 3 entries) >> PF11_0377 >> PF11_0405 >> PFB0305c >> PFB0391c >> >> Surprisingly (to me anyway) I noticed that the duplicates all had unique hjid's, which may or may not mean anything... >> >> In conclusion, it seems to me that the 5338 id's in the gdb are likely correct. >> >> Dr. D, does that make sense? >> >> Richard >> >> Sent from my iPhone >> >> On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: >> >>> Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. >>> >>> I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ >>> >>> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. >>> >>> Richard >>> >>> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: >>> Hi, >>> >>> I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. >>> >>> I believe that the >>> >>> >>> "MAL[0-9]*P1.[0-9]*" >>> >>> >>> >>> pattern is also not pulling out everything it needs to. But >>> instead >>> >>> >>> of including more patterns, I would rather just loosen up the >>> criteria to >>> >>> >>> include all things in the ORF tag. >>> >>> >>> >>> Also, I just want to be clear about the underscore issue. That >>> only >>> >>> >>> affects IDs that begin with PFA, not the other IDs that begin with >>> PF##_ >>> >>> >>> >>> Thanks, >>> >>> >>> Kam >>> >>> >>> >>> >>> At 12:47 PM 3/18/2011, Richard Brous wrote: >>>> I'm down in the bio lab at the moment looking at this. >>>> >>>> I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. >>>> >>>> I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. >>>> >>>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. >>>> >>>> Richard >>>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: >>>> Thanks for info. I will dig into this after my 10 am exam tomorrow. >>>> Richard >>>> Sent from my iPhone >>>> >>>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: >>>> >>>>> Hi, >>>>> More information on the underscore issue: >>>>> There is an ID with the pattern >>>>> PFA_[0-9][0-9][0-9][0-9][wc] >>>>> that needs to have the underscore removed so that reads instead >>>>> PFA[0-9][0-9][0-9][0-9][wc] >>>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. >>>>> If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. >>>>> Best, >>>>> Kam >>>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >>>>>> Hi, >>>>>> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. >>>>>> Best, >>>>>> Kam >>>>>> At 09:48 PM 3/14/2011, you wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: >>>>>>> >>>>>>> select * from genenametype where type = 'ordered locus' >>>>>>> Returned zero gene ids >>>>>>> >>>>>>> select * from genenametype where type = 'ORF' >>>>>>> Returned 5345 gene ids >>>>>>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. >>>>>>> >>>>>>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: >>>>>>> MAL >>>>>>> PF##_ >>>>>>> PFA >>>>>>> PFB >>>>>>> PFC >>>>>>> PFD >>>>>>> PFE >>>>>>> PFF >>>>>>> PFI >>>>>>> PFL >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: >>>>>>> Hi, >>>>>>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). >>>>>>> Best, >>>>>>> Kam >>>>>>> At 08:09 AM 3/14/2011, you wrote: >>>>>>>> Thanks Dondi, >>>>>>>> >>>>>>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes >>>>>>>> >>>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. >>>>>>>> >>>>>>>> Richard >>>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: >>>>>>>> Thanks for the updates, Rich. >>>>>>>> I gave things a once-over and may have a lead. Here is what I found: >>>>>>>> - First, the TallyEngine customization for P. falciparum states the following: >>>>>>>> # Plasmodium falciparum >>>>>>>> plasmodiumfalciparum_level_amount=2 >>>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>>>>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; >>>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; >>>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus >>>>>>>> plasmodiumfalciparum_table_name_level1=UniGene >>>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). >>>>>>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + >>>>>>>> "from genenametype c inner join entrytype_genetype d " + >>>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>>>>>> "where (c.value similar to ? " + >>>>>>>> "or c.value similar to ? " + >>>>>>>> "or c.value similar to ?) " + >>>>>>>> "and type <> 'ordered locus names' " + >>>>>>>> "and type <> 'ORF' " + >>>>>>>> "group by d.entrytype_gene_hjid, c.value"; >>>>>>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) >>>>>>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. >>>>>>>> Hope this helps... >>>>>>>> John David N. Dionisio, PhD >>>>>>>> Associate Professor, Computer Science >>>>>>>> Loyola Marymount University >>>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>>>>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... >>>>>>>> > >>>>>>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . >>>>>>>> > Richard >>>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: >>>>>>>> > Hi all, >>>>>>>> > >>>>>>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. >>>>>>>> > >>>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. >>>>>>>> > >>>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. >>>>>>>> > >>>>>>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. >>>>>>>> > >>>>>>>> > More info as it comes... >>>>>>>> > >>>>>>>> > Richard >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. >>>>>>>> > >>>>>>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>>>>>> > >>>>>>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>>>>>> > >>>>>>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>>>>> > >>>>>>>> > Here is the list of bugs/action items that I've listed: >>>>>>>> > >>>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. >>>>>>>> > >>>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). >>>>>>>> > >>>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Kam >>>>>>>> > >>>>>>>> > >>>>>>>> > ------------------------------------------------------------------------------ >>>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>>>>>> > This paper provides an overview of data connectivity, details >>>>>>>> > its effect on application quality, and explores various alternative >>>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d >>>>>>>> > _______________________________________________ >>>>>>>> > xmlpipedb-developer mailing list >>>>>>>> > xml...@li... >>>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > <ATT00001..txt><ATT00002..txt> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Colocation vs. Managed Hosting >>>>>>>> A question and answer guide to determining the best fit >>>>>>>> for your organization - today and in the future. >>>>>>>> http://p.sf.net/sfu/internap-sfd2d >>>>>>>> _______________________________________________ >>>>>>>> xmlpipedb-developer mailing list >>>>>>>> xml...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-03-21 04:19:27
|
I did another export after re-enabling the group by and the results are the same at 5338 gene id's. At least we know either way now. Moving forward are we just going to leave the duplicate id situation as is or Dondi can you think of an option here? Then the last item will be to keep the original underscore id's but also remove the underscores and add to the 5338. Richard On Sun, Mar 20, 2011 at 6:51 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I looked up those 7 IDs in UniProt, and they each are found in 2-3 separate > UniProt entries, which would explain why they had separate hjids. > > Probably what GenMAPP Builder is doing is taking the first one and relating > it to its UniProt ID and then discarding the second (or third) > relationship. I don't know that there's anything we can do about this since > it is a problem with the raw data. > > Best, > Dr. D > > > > At 12:15 PM 3/20/2011, Richard Brous wrote: > > Having weird probs with gmail in browser where the reply button isn't > working so have to send from my phone... Odd > > So I here is my analysis on the ORF only export gdb: > > Initially the raw SQL query returned 5345 records of which I posted an > excel doc link to the plasmodium page on the biodb wiki. > > The gdb contained only 5338 genes id's so I started looking for duplicates > or exclusions. > > What I found were 7 id's that were duplicated in the raw SQL query of > postgres but were not exported into the gdb. > > The id's are: > PF10_0168 > PF11_0361 (duplicated 2x for total of 3 entries) > PF11_0377 > PF11_0405 > PFB0305c > PFB0391c > > Surprisingly (to me anyway) I noticed that the duplicates all had unique > hjid's, which may or may not mean anything... > > In conclusion, it seems to me that the 5338 id's in the gdb are likely > correct. > > Dr. D, does that make sense? > > Richard > > Sent from my iPhone > > On Mar 18, 2011, at 1:16 PM, Richard Brous < <rbr...@gm...> > rbr...@gm...> wrote: > > Yes I understand that you only want the 'ORF' but was trying to get what we > need by modifying what we have and not rewite the whole query. > > I may in fact have to rewrite the whole thing anyway as my query didn't > return what was expected =/ > > Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an > eye out for missing id's here. > > Richard > > On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist < <kda...@lm...> > kda...@lm...> wrote: > Hi, > > I think we need to capture all of the IDs in the ORF tag and *NOT* do the > pattern match at all. As far as I can tell with my analysis of the IDs in > the query you posted, we need to keep them all, so we don't actually need to > specify the patterns at this point. I would rather do that thinking towards > the future when the Plasmodium people might add new patterns to the ID > system. > > I believe that the > > > "MAL[0-9]*P1.[0-9]*" > > pattern is also not pulling out everything it needs to. But > instead > of including more patterns, I would rather just loosen up the > criteria to > include all things in the ORF tag. > > Also, I just want to be clear about the underscore issue. That > only > affects IDs that begin with PFA, not the other IDs that begin with > PF##_ > > Thanks, > Kam > > > > > At 12:47 PM 3/18/2011, Richard Brous wrote: > > I'm down in the bio lab at the moment looking at this. > > I understand what needs to be done in regards to 1) keeping all ORF id's > and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > > I performed an export with the pattern match as-is and commented out the > exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 > gene id's. So it seems we are missing 235 gene id's that are in the XML file > as seen from the raw sql query. > > Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead > and added it into the pattern match string and called it as the others are > called in the query. Once the export completes I'll confirm that in fact we > have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > > Richard > On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous < <rbr...@gm...> > rbr...@gm...> wrote: Thanks for info. I will dig into this after my > 10 am exam tomorrow. > Richard > Sent from my iPhone > > On Mar 16, 2011, at 4:18 PM, Kam Dahlquist < <kda...@lm...> > kda...@lm...> wrote: > > Hi, > More information on the underscore issue: > There is an ID with the pattern > PFA_[0-9][0-9][0-9][0-9][wc] > that needs to have the underscore removed so that reads instead > PFA[0-9][0-9][0-9][0-9][wc] > I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > If Rich can make these two fixes (capturing the ORFs and dealing with the > underscore), then I think we will be good to go with Plasmodium. There may > be code in the Vibrio or Helicobacter profiles to help with the underscores, > but I'm not sure. > Best, Kam > At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > Hi, > I've taken a look at the list of IDs and did a quick comparison with both > the older released gdb and also a list I downloaded from the Broad Institute > Plasmodium database. I think we can safely go with the query on the ORF tag > for our export--all of those different ID forms are valid. There are about > 400 IDs that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the difference is > mainly due to a +/- underscore issue that we might need to solve. However, > we should go forward with capturing all the IDs from the ORF tag, I don't > see a need to restrict to a particular pattern there. > Best, Kam > At 09:48 PM 3/14/2011, you wrote: > > Hi all, So I went ahead and did raw sql queries of the Postgres data and > turned up the following: select * from genenametype where type = > 'ordered locus' Returned zero gene ids select * from genenametype where > type = 'ORF' Returned 5345 gene ids The type = 'ORF' query was exported > into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. > There are many many patterns in regards to gene ids, here the the prefixes > from my cursory look: MAL PF##_ PFA PFB PFC PFD PFE PFF PFI PFL > Richard On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <<kda...@lm...> > kda...@lm...> wrote: Hi, I looked up an assortment of IDs in UniProt > and I can confirm that it appears that the IDs are found in the ORF tag, not > the OrderedLocus tag (except for the one that got captured in the export). Best, > Kam At 08:09 AM 3/14/2011, you wrote: > > Thanks Dondi, Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with progress > at 65% and 6.5 Gb of log files so far... /yikes Btw I have a work lunch > meeting in Beverly Hills today so will be working from home afterwards > instead of in the bio lab. Richard On Sun, Mar 13, 2011 at 9:55 PM, John > David N. Dionisio < <do...@lm...>do...@lm...> wrote: Thanks for the > updates, Rich. I gave things a once-over and may have a lead. Here is > what I found: - First, the TallyEngine customization for P. falciparum > states the following: # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > plasmodiumfalciparum_query_level0=select count(*) from genenametype where > type = 'ORF'; plasmodiumfalciparum_query_level1=select count(*) from > genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered > Locus plasmodiumfalciparum_table_name_level1=UniGene Thus, what is being > counted by TallyEngine as "Ordered Locus" are the gene names whose type is > 'ORF' ("level0" properties). - Now, this is what the P. falciparum species > profile does when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + "where > (c.value similar to ? " + "or c.value similar to ? " + > "or c.value similar to ?) " + "and type <> 'ordered locus > names' " + "and type <> 'ORF' " + "group by > d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last > line --- the query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally Engine > here, since, the export puts only one record in OrderedLocusNames) Still, > comparing these two queries directly against the PostgreSQL database would > be educational, I think. Then, knowing which criteria are correct, the > appropriate action can then be taken, I think. Hope this helps... John > David N. Dionisio, PhD Associate Professor, Computer Science Loyola > Marymount University On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > Debug export is still going... 2.5GB of log files so far with progress at > 65%... > > I posted the link of the WARN log on the plasmodium page here: > <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum> > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <<rbr...@gm...> > rbr...@gm...> wrote: > Hi all, > > Have been working through several > Plasmodium gdb exports in an attempt to source why only one gene id makes it > into the Ordered Locus table. > > I have reviewed the logger file while > set to "WARN" and wasn't able to determine anything which would suggest an > error. I will post this log file to the wiki later today when I get home. > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > Unless a better > suggestion is offered I'm going to rerun an export again with 'DEBUG" > verbosity and up the file sizes to near 1 GB each and hope that 3 GB total > will be enough to hold the complete export log. > > More info as it > comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam > Dahlquist < <kda...@lm...>kda...@lm...> wrote: > Hi, > > I've > completed testing the Plasmodium gdb I exported last November and updated > the SourceForge wiki. > > Plasmodium has it's own task list page, which > I've updated here: > <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List> > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > The testing report can be found here: > <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115> > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum> > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > Here is the list of bugs/action items that I've listed: > > 1. The > OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the > TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > 2. The GeneId table in the database has 6 fewer > IDs than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with the > TallyEngine). > > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell at this > point whether this is due to a change in annotation by UniProt or is a bug > with GenMAPP Builder. > > Thanks, > Kam > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > This paper > provides an overview of data connectivity, details > its effect on > application quality, and explores various alternative > solutions. > <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > xmlpipedb-developer > mailing list > <xml...@li...> > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting A question and answer guide to determining > the best fit for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d _______________________________________________ > xmlpipedb-developer mailing list xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-03-21 01:51:28
|
Hi, I looked up those 7 IDs in UniProt, and they each are found in 2-3 separate UniProt entries, which would explain why they had separate hjids. Probably what GenMAPP Builder is doing is taking the first one and relating it to its UniProt ID and then discarding the second (or third) relationship. I don't know that there's anything we can do about this since it is a problem with the raw data. Best, Dr. D At 12:15 PM 3/20/2011, Richard Brous wrote: >Having weird probs with gmail in browser where the reply button >isn't working so have to send from my phone... Odd > >So I here is my analysis on the ORF only export gdb: > >Initially the raw SQL query returned 5345 records of which I posted >an excel doc link to the plasmodium page on the biodb wiki. > >The gdb contained only 5338 genes id's so I started looking for >duplicates or exclusions. > >What I found were 7 id's that were duplicated in the raw SQL query >of postgres but were not exported into the gdb. > >The id's are: >PF10_0168 >PF11_0361 (duplicated 2x for total of 3 entries) >PF11_0377 >PF11_0405 >PFB0305c >PFB0391c > >Surprisingly (to me anyway) I noticed that the duplicates all had >unique hjid's, which may or may not mean anything... > >In conclusion, it seems to me that the 5338 id's in the gdb are >likely correct. > >Dr. D, does that make sense? > >Richard > >Sent from my iPhone > ><mailto:rbr...@gm...>On Mar 18, 2011, at 1:16 PM, Richard >Brous <<mailto:rbr...@gm...>rbr...@gm...> wrote: > >>Yes I understand that you only want the 'ORF' but was trying to get >>what we need by modifying what we have and not rewite the whole query. >> >>I may in fact have to rewrite the whole thing anyway as my query >>didn't return what was expected =/ >> >>Also I'm glad you mentioned a possible issue with MAL pattern, I'll >>keep an eye out for missing id's here. >> >>Richard >> >><mailto:kda...@lm...>On Fri, Mar 18, 2011 at 12:04 PM, Kam >>Dahlquist <<mailto:kda...@lm...>kda...@lm...> wrote: >>Hi, >> >>I think we need to capture all of the IDs in the ORF tag and *NOT* >>do the pattern match at all. As far as I can tell with my analysis >>of the IDs in the query you posted, we need to keep them all, so we >>don't actually need to specify the patterns at this point. I would >>rather do that thinking towards the future when the Plasmodium >>people might add new patterns to the ID system. >> >>I believe that the >> >> >>"MAL[0-9]*P1.[0-9]*" >> >> >>pattern is also not pulling out everything it needs to. But instead >> >>of including more patterns, I would rather just loosen up the criteria to >> >>include all things in the ORF tag. >> >> >>Also, I just want to be clear about the underscore issue. That only >> >>affects IDs that begin with PFA, not the other IDs that begin with PF##_ >> >> >>Thanks, >> >>Kam >> >> >> >>At 12:47 PM 3/18/2011, Richard Brous wrote: >>>I'm down in the bio lab at the moment looking at this. >>> >>>I understand what needs to be done in regards to 1) keeping all >>>ORF id's and then 2) querying the id's with underscores to then >>>remove the underscores but maintaining the original underscore id's. >>> >>>I performed an export with the pattern match as-is and commented >>>out the exclude 'ordered locus' and 'ORF'. The export completed >>>but with only 5110 gene id's. So it seems we are missing 235 gene >>>id's that are in the XML file as seen from the raw sql query. >>> >>>Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I >>>went ahead and added it into the pattern match string and called >>>it as the others are called in the query. Once the export >>>completes I'll confirm that in fact we have captured all the id's. >>>Once confirmed, I will move onto the query to find id's with >>>underscores and handle them as mentioned above. >>> >>>Richard >>><mailto:rbr...@gm...>On Thu, Mar 17, 2011 at 7:34 AM, >>>Richard Brous <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>Thanks for info. I will dig into this after my 10 am exam tomorrow. >>>Richard >>>Sent from my iPhone >>> >>><mailto:kda...@lm...>On Mar 16, 2011, at 4:18 PM, Kam >>>Dahlquist <<mailto:kda...@lm...>kda...@lm...> wrote: >>> >>>>Hi, >>>>More information on the underscore issue: >>>>There is an ID with the pattern >>>>PFA_[0-9][0-9][0-9][0-9][wc] >>>>that needs to have the underscore removed so that reads instead >>>>PFA[0-9][0-9][0-9][0-9][wc] >>>>I don't know why these IDs exist in UniProt, but in PlasmoDB, >>>>they are there without the underscore and won't be recognized >>>>with it. I think we should leave the underscore ones there, but >>>>also have a set without the underscore. There are 134 records >>>>that have this issue. >>>>If Rich can make these two fixes (capturing the ORFs and dealing >>>>with the underscore), then I think we will be good to go with >>>>Plasmodium. There may be code in the Vibrio or Helicobacter >>>>profiles to help with the underscores, but I'm not sure. >>>>Best, >>>>Kam >>>>At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >>>>>Hi, >>>>>I've taken a look at the list of IDs and did a quick comparison >>>>>with both the older released gdb and also a list I downloaded >>>>>from the Broad Institute Plasmodium database. I think we can >>>>>safely go with the query on the ORF tag for our export--all of >>>>>those different ID forms are valid. There are about 400 IDs >>>>>that are different in the older released gdb than in the new >>>>>query; I'm going to further investigate those. I suspect that >>>>>the difference is mainly due to a +/- underscore issue that we >>>>>might need to solve. However, we should go forward with >>>>>capturing all the IDs from the ORF tag, I don't see a need to >>>>>restrict to a particular pattern there. >>>>>Best, >>>>>Kam >>>>>At 09:48 PM 3/14/2011, you wrote: >>>>>>Hi all, >>>>>> >>>>>>So I went ahead and did raw sql queries of the Postgres data >>>>>>and turned up the following: >>>>>> >>>>>>select * from genenametype where type = 'ordered locus' >>>>>>Returned zero gene ids >>>>>> >>>>>>select * from genenametype where type = 'ORF' >>>>>>Returned 5345 gene ids >>>>>>The type = 'ORF' query was exported into excel and posted to >>>>>>the biodb wiki on the Spring 2011 Plasmodium page. >>>>>> >>>>>>There are many many patterns in regards to gene ids, here the >>>>>>the prefixes from my cursory look: >>>>>>MAL >>>>>>PF##_ >>>>>>PFA >>>>>>PFB >>>>>>PFC >>>>>>PFD >>>>>>PFE >>>>>>PFF >>>>>>PFI >>>>>>PFL >>>>>>Richard >>>>>> >>>>>> >>>>>><mailto:kda...@lm...>On Mon, Mar 14, 2011 at 10:32 AM, >>>>>>Kam Dahlquist <<mailto:kda...@lm...>kda...@lm...> wrote: >>>>>>Hi, >>>>>>I looked up an assortment of IDs in UniProt and I can confirm >>>>>>that it appears that the IDs are found in the ORF tag, not the >>>>>>OrderedLocus tag (except for the one that got captured in the export). >>>>>>Best, >>>>>>Kam >>>>>>At 08:09 AM 3/14/2011, you wrote: >>>>>>>Thanks Dondi, >>>>>>> >>>>>>>Will review this after our call today. I have been a little >>>>>>>worried as the DEBUG export has been going for 2.5 days with >>>>>>>progress at 65% and 6.5 Gb of log files so far... /yikes >>>>>>> >>>>>>>Btw I have a work lunch meeting in Beverly Hills today so will >>>>>>>be working from home afterwards instead of in the bio lab. >>>>>>> >>>>>>>Richard >>>>>>><mailto:do...@lm...>On Sun, Mar 13, 2011 at 9:55 PM, John >>>>>>>David N. Dionisio <<mailto:do...@lm...>do...@lm...> wrote: >>>>>>>Thanks for the updates, Rich. >>>>>>>I gave things a once-over and may have a lead. Here is what I found: >>>>>>>- First, the TallyEngine customization for P. falciparum >>>>>>>states the following: >>>>>>># Plasmodium falciparum >>>>>>>plasmodiumfalciparum_level_amount=2 >>>>>>>plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>>>>>plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>>>>> >>>>>>>plasmodiumfalciparum_query_level0=select count(*) from >>>>>>>genenametype where type = 'ORF'; >>>>>>>plasmodiumfalciparum_query_level1=select count(*) from >>>>>>>genenametype where type = 'UniGene'; >>>>>>>plasmodiumfalciparum_table_name_level0=Ordered Locus >>>>>>>plasmodiumfalciparum_table_name_level1=UniGene >>>>>>>Thus, what is being counted by TallyEngine as "Ordered Locus" >>>>>>>are the gene names whose type is 'ORF' ("level0" properties). >>>>>>>- Now, this is what the P. falciparum species profile does >>>>>>>when harvesting IDs >>>>>>>(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>>>>> >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as >>>>>>> hjid, c.value " + >>>>>>> "from genenametype c inner join entrytype_genetype d " + >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>>>>> "where (c.value similar to ? " + >>>>>>> "or c.value similar to ? " + >>>>>>> "or c.value similar to ?) " + >>>>>>> "and type <> 'ordered locus names' " + >>>>>>> "and type <> 'ORF' " + >>>>>>> "group by d.entrytype_gene_hjid, c.value"; >>>>>>>Note the condition on the second-to-last line --- the query >>>>>>>actually *omits* gene names whose type is 'ORF'! So the >>>>>>>question is...which is right? (I'm inclined to believe the >>>>>>>Tally Engine here, since, the export puts only one record in >>>>>>>OrderedLocusNames) >>>>>>>Still, comparing these two queries directly against the >>>>>>>PostgreSQL database would be educational, I think. Then, >>>>>>>knowing which criteria are correct, the appropriate action can >>>>>>>then be taken, I think. >>>>>>>Hope this helps... >>>>>>>John David N. Dionisio, PhD >>>>>>>Associate Professor, Computer Science >>>>>>>Loyola Marymount University >>>>>>>On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>>>>> > Debug export is still going... 2.5GB of log files so far >>>>>>> with progress at 65%... >>>>>>> > >>>>>>><https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>> >>>>>>>I posted the link of the WARN log on the plasmodium page here: >>>>>>><https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>>>>. >>>>>>> > Richard >>>>>>><mailto:rbr...@gm...>> On Fri, Mar 11, 2011 at 1:06 PM, >>>>>>>Richard Brous <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>>>> > Hi all, >>>>>>> > >>>>>>> > Have been working through several Plasmodium gdb exports in >>>>>>> an attempt to source why only one gene id makes it into the >>>>>>> Ordered Locus table. >>>>>>> > >>>>>>> > I have reviewed the logger file while set to "WARN" and >>>>>>> wasn't able to determine anything which would suggest an >>>>>>> error. I will post this log file to the wiki later today when I get home. >>>>>>> > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size >>>>>>> to 100MB with hopes that more detail will surface the issue, >>>>>>> but my export is on hour 20 and still going (although its >>>>>>> nearly complete). What I didn't expect was the size of the >>>>>>> log files and that it seems only the last 3 are kept with >>>>>>> earlier logs being overwritten =( I fear that the information >>>>>>> I need it in one of the earlier files which are now lost. >>>>>>> > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an >>>>>>> export again with 'DEBUG" verbosity and up the file sizes to >>>>>>> near 1 GB each and hope that 3 GB total will be enough to >>>>>>> hold the complete export log. >>>>>>> > >>>>>>> > More info as it comes... >>>>>>> > >>>>>>> > Richard >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>><mailto:kda...@lm...>> On Fri, Mar 4, 2011 at 3:17 PM, >>>>>>>Kam Dahlquist <<mailto:kda...@lm...>kda...@lm...> wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I've completed testing the Plasmodium gdb I exported last >>>>>>> November and updated the SourceForge wiki. >>>>>>> > >>>>>>><https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>> >>>>>>>Plasmodium has it's own task list page, which I've updated >>>>>>>here: >>>>>>><https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>>>>> >>>>>>> > >>>>>>><https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>> >>>>>>>The testing report can be found >>>>>>>here: >>>>>>><https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>>>>> >>>>>>> > >>>>>>><https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>> >>>>>>>The source files and gdb are on a new Plasmodium falciparum >>>>>>>page on the Fall 2010 BiolDB >>>>>>>wiki: >>>>>>><https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>>>> >>>>>>> > >>>>>>> > Here is the list of bugs/action items that I've listed: >>>>>>> > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID >>>>>>> out of 5345 repored by the TallyEngine. This also affects all >>>>>>> other tables related to OrderedLocusNames. >>>>>>> > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than >>>>>>> reported by the TallyEngine (Mycobacterium smegmatis and >>>>>>> Mycobacterium tuberculosis also have mysterious GeneId issues >>>>>>> with the TallyEngine). >>>>>>> > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's >>>>>>> lower than the 2009 version of the gdb. There's no way to >>>>>>> tell at this point whether this is due to a change in >>>>>>> annotation by UniProt or is a bug with GenMAPP Builder. >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Kam >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ------------------------------------------------------------------------------ >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>>>>> > This paper provides an overview of data connectivity, details >>>>>>> > its effect on application quality, and explores various alternative >>>>>>><http://p.sf.net/sfu/progress-d2d>> solutions. >>>>>>><http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d >>>>>>> > _______________________________________________ >>>>>>> > xmlpipedb-developer mailing list >>>>>>><mailto:xml...@li...>> >>>>>>><mailto:xml...@li...>xml...@li... >>>>>>> >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > <ATT00001..txt><ATT00002..txt> >>>>>>>------------------------------------------------------------------------------ >>>>>>> >>>>>>>Colocation vs. Managed Hosting >>>>>>>A question and answer guide to determining the best fit >>>>>>>for your organization - today and in the future. >>>>>>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>>>>>>_______________________________________________ >>>>>>>xmlpipedb-developer mailing list >>>>>>><mailto:xml...@li...>xml...@li... >>>>>>> >>>>>>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >>------------------------------------------------------------------------------ >> >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >> >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >>------------------------------------------------------------------------------ >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> >>------------------------------------------------------------------------------ >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > |
From: John D. N. D. <do...@lm...> - 2011-03-21 00:22:55
|
Hi Rich, No, nothing huge missed. That just means that there is some "anti-duplicate" code in the export process. Adding in the "group by" takes the conservative approach that, *if* there were no safeguards against duplicates, at least your original query minimizes the chances that they will lie around. More of an err-on-the-side-of-caution move, really. Your call if you are confident that it is safe to omit the "group by." John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 20, 2011, at 5:17 PM, Richard Brous wrote: > The pairs were returned by the raw sql where type = 'ORF' > > The exported gdb doesn't contain the duplicate id's which is why I think its ok. Am I missing something? > > Richard > > On Sun, Mar 20, 2011 at 4:00 PM, John David N. Dionisio <do...@lm...> wrote: > Thanks. Try reinstating that last group by clause; it may remove some duplicate hjid/value pairs. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 1:54 PM, Richard Brous wrote: > > > Sure thing. Can send now that gmail working again... > > > > /** > > > > * > > > > @see edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, java.util.Date) > > */ > > > > > > @Override > > > > public TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { > > > > // Start with the default OrderedLocusNames behavior. > > TableManager result = > > > > super.getSystemTableManagerCustomizations(tableManager, primarySystemTableManager, version); > > > > > > // Next, we add IDs from the other gene/name tags, but ONLY if they match > > > > // the pattern PF[A-Z][0-9]{4}[a-z]. > > > > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > > > > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > > > > //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > > String sqlQuery = > > > > "select d.entrytype_gene_hjid as hjid, c.value " + > > > > "from genenametype c inner join entrytype_genetype d " + > > > > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > > > //"where (c.value similar to ? " + > > > > "where c.type = 'ORF'"; > > > > //"or c.value similar to ? " + > > > > //"or c.value similar to ?) " + > > > > //"and type <> 'ordered locus names' " + > > > > //"and type <> 'ORF' " + > > > > //"group by d.entrytype_gene_hjid, c.value"; > > > > String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); > > > > Connection c = ConnectionManager.getRelationalDBConnection(); > > > > PreparedStatement ps; > > > > ResultSet rs; > > > > > > try { > > > > // Query, iterate, add to table manager. > > ps = c.prepareStatement(sqlQuery); > > > > > > //ps.setString(1, pfID); > > > > //ps.setString(2, pfID2); > > > > //ps.setString(3, pfID3); > > rs = ps.executeQuery(); > > > > > > while (rs.next()) { > > String hjid = Long.valueOf(rs.getLong( > > > > "hjid")).toString(); > > String id = rs.getString( > > > > "value"); > > > > _Log.debug("Processing raw ID: " + id + " for surrogate " + hjid); > > tableManager.submit( > > > > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { "UID", hjid } }); > > } > > > > } > > > > catch(SQLException sqlexc) { > > logSQLException(sqlexc, sqlQuery); > > > > } > > > > > > > > return result; > > } > > > > > > > > On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...> wrote: > > Hi Rich, > > > > Thanks for the updates. Did you post the new SQL query that you used somewhere? Let me know where I can take a look at it. Or, if you haven't put it out there yet, send it my way just for a quick look-see. Thanks! > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > > > Having weird probs with gmail in browser where the reply button isn't working so have to send from my phone... Odd > > > > > > So I here is my analysis on the ORF only export gdb: > > > > > > Initially the raw SQL query returned 5345 records of which I posted an excel doc link to the plasmodium page on the biodb wiki. > > > > > > The gdb contained only 5338 genes id's so I started looking for duplicates or exclusions. > > > > > > What I found were 7 id's that were duplicated in the raw SQL query of postgres but were not exported into the gdb. > > > > > > The id's are: > > > PF10_0168 > > > PF11_0361 (duplicated 2x for total of 3 entries) > > > PF11_0377 > > > PF11_0405 > > > PFB0305c > > > PFB0391c > > > > > > Surprisingly (to me anyway) I noticed that the duplicates all had unique hjid's, which may or may not mean anything... > > > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely correct. > > > > > > Dr. D, does that make sense? > > > > > > Richard > > > > > > Sent from my iPhone > > > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > > > >> Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. > > >> > > >> I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ > > >> > > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. > > >> > > >> Richard > > >> > > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: > > >> Hi, > > >> > > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. > > >> > > >> I believe that the > > >> > > >> "MAL[0-9]*P1.[0-9]*" > > >> > > >> pattern is also not pulling out everything it needs to. But instead > > >> of including more patterns, I would rather just loosen up the criteria to > > >> include all things in the ORF tag. > > >> > > >> Also, I just want to be clear about the underscore issue. That only > > >> affects IDs that begin with PFA, not the other IDs that begin with PF##_ > > >> > > >> Thanks, > > >> Kam > > >> > > >> > > >> > > >> > > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > > >>> I'm down in the bio lab at the moment looking at this. > > >>> > > >>> I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. > > >>> > > >>> I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. > > >>> > > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. > > >>> > > >>> Richard > > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: > > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > > >>> > > >>> Richard > > >>> > > >>> Sent from my iPhone > > >>> > > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> More information on the underscore issue: > > >>>> > > >>>> There is an ID with the pattern > > >>>> > > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> that needs to have the underscore removed so that reads instead > > >>>> > > >>>> PFA[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. > > >>>> > > >>>> If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. > > >>>> > > >>>> Best, > > >>>> Kam > > >>>> > > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > >>>>> Hi, > > >>>>> > > >>>>> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. > > >>>>> > > >>>>> Best, > > >>>>> Kam > > >>>>> > > >>>>> At 09:48 PM 3/14/2011, you wrote: > > >>>>>> Hi all, > > >>>>>> > > >>>>>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: > > >>>>>> > > >>>>>> select * from genenametype where type = 'ordered locus' > > >>>>>> Returned zero gene ids > > >>>>>> > > >>>>>> select * from genenametype where type = 'ORF' > > >>>>>> Returned 5345 gene ids > > >>>>>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. > > >>>>>> > > >>>>>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: > > >>>>>> MAL > > >>>>>> PF##_ > > >>>>>> PFA > > >>>>>> PFB > > >>>>>> PFC > > >>>>>> PFD > > >>>>>> PFE > > >>>>>> PFF > > >>>>>> PFI > > >>>>>> PFL > > >>>>>> > > >>>>>> Richard > > >>>>>> > > >>>>>> > > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: > > >>>>>> Hi, > > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). > > >>>>>> Best, > > >>>>>> Kam > > >>>>>> At 08:09 AM 3/14/2011, you wrote: > > >>>>>>> Thanks Dondi, > > >>>>>>> > > >>>>>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes > > >>>>>>> > > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. > > >>>>>>> > > >>>>>>> Richard > > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: > > >>>>>>> Thanks for the updates, Rich. > > >>>>>>> I gave things a once-over and may have a lead. Here is what I found: > > >>>>>>> - First, the TallyEngine customization for P. falciparum states the following: > > >>>>>>> # Plasmodium falciparum > > >>>>>>> plasmodiumfalciparum_level_amount=2 > > >>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > > >>>>>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; > > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; > > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). > > >>>>>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > > >>>>>>> "from genenametype c inner join entrytype_genetype d " + > > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > >>>>>>> "where (c.value similar to ? " + > > >>>>>>> "or c.value similar to ? " + > > >>>>>>> "or c.value similar to ?) " + > > >>>>>>> "and type <> 'ordered locus names' " + > > >>>>>>> "and type <> 'ORF' " + > > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > > >>>>>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) > > >>>>>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. > > >>>>>>> Hope this helps... > > >>>>>>> John David N. Dionisio, PhD > > >>>>>>> Associate Professor, Computer Science > > >>>>>>> Loyola Marymount University > > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > >>>>>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... > > >>>>>>> > > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > >>>>>>> > Richard > > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: > > >>>>>>> > Hi all, > > >>>>>>> > > > >>>>>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. > > >>>>>>> > > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. > > >>>>>>> > > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. > > >>>>>>> > > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. > > >>>>>>> > > > >>>>>>> > More info as it comes... > > >>>>>>> > > > >>>>>>> > Richard > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. > > >>>>>>> > > > >>>>>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > >>>>>>> > > > >>>>>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > >>>>>>> > > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > >>>>>>> > > > >>>>>>> > Here is the list of bugs/action items that I've listed: > > >>>>>>> > > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. > > >>>>>>> > > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). > > >>>>>>> > > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. > > >>>>>>> > > > >>>>>>> > Thanks, > > >>>>>>> > Kam > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > ------------------------------------------------------------------------------ > > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > > >>>>>>> > This paper provides an overview of data connectivity, details > > >>>>>>> > its effect on application quality, and explores various alternative > > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > > >>>>>>> > _______________________________________________ > > >>>>>>> > xmlpipedb-developer mailing list > > >>>>>>> > xml...@li... > > >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > > >>>>>>> ------------------------------------------------------------------------------ > > >>>>>>> Colocation vs. Managed Hosting > > >>>>>>> A question and answer guide to determining the best fit > > >>>>>>> for your organization - today and in the future. > > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > > >>>>>>> _______________________________________________ > > >>>>>>> xmlpipedb-developer mailing list > > >>>>>>> xml...@li... > > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > > >> ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > > <ATT00001..txt><ATT00002..txt> > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-03-21 00:18:01
|
The pairs were returned by the raw sql where type = 'ORF' The exported gdb doesn't contain the duplicate id's which is why I think its ok. Am I missing something? Richard On Sun, Mar 20, 2011 at 4:00 PM, John David N. Dionisio <do...@lm...>wrote: > Thanks. Try reinstating that last group by clause; it may remove some > duplicate hjid/value pairs. > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 1:54 PM, Richard Brous wrote: > > > Sure thing. Can send now that gmail working again... > > > > /** > > > > * > > > > @see > edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, > java.util.Date) > > */ > > > > > > @Override > > > > public TableManager getSystemTableManagerCustomizations(TableManager > tableManager, TableManager primarySystemTableManager, Date version) throws > SQLException, InvalidParameterException { > > > > // Start with the default OrderedLocusNames behavior. > > TableManager result = > > > > super.getSystemTableManagerCustomizations(tableManager, > primarySystemTableManager, version); > > > > > > // Next, we add IDs from the other gene/name tags, but ONLY if they match > > > > // the pattern PF[A-Z][0-9]{4}[a-z]. > > > > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > > > > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > > > > //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > > String sqlQuery = > > > > "select d.entrytype_gene_hjid as hjid, c.value " + > > > > "from genenametype c inner join entrytype_genetype d " + > > > > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > > > //"where (c.value similar to ? " + > > > > "where c.type = 'ORF'"; > > > > //"or c.value similar to ? " + > > > > //"or c.value similar to ?) " + > > > > //"and type <> 'ordered locus names' " + > > > > //"and type <> 'ORF' " + > > > > //"group by d.entrytype_gene_hjid, c.value"; > > > > String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); > > > > Connection c = ConnectionManager.getRelationalDBConnection(); > > > > PreparedStatement ps; > > > > ResultSet rs; > > > > > > try { > > > > // Query, iterate, add to table manager. > > ps = c.prepareStatement(sqlQuery); > > > > > > //ps.setString(1, pfID); > > > > //ps.setString(2, pfID2); > > > > //ps.setString(3, pfID3); > > rs = ps.executeQuery(); > > > > > > while (rs.next()) { > > String hjid = Long.valueOf(rs.getLong( > > > > "hjid")).toString(); > > String id = rs.getString( > > > > "value"); > > > > _Log.debug("Processing raw ID: " + id + " for surrogate " + hjid); > > tableManager.submit( > > > > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { > "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { > "UID", hjid } }); > > } > > > > } > > > > catch(SQLException sqlexc) { > > logSQLException(sqlexc, sqlQuery); > > > > } > > > > > > > > return result; > > } > > > > > > > > On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...> > wrote: > > Hi Rich, > > > > Thanks for the updates. Did you post the new SQL query that you used > somewhere? Let me know where I can take a look at it. Or, if you haven't > put it out there yet, send it my way just for a quick look-see. Thanks! > > > > John David N. Dionisio, PhD > > Associate Professor, Computer Science > > Loyola Marymount University > > > > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > > > Having weird probs with gmail in browser where the reply button isn't > working so have to send from my phone... Odd > > > > > > So I here is my analysis on the ORF only export gdb: > > > > > > Initially the raw SQL query returned 5345 records of which I posted an > excel doc link to the plasmodium page on the biodb wiki. > > > > > > The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > > > > > > What I found were 7 id's that were duplicated in the raw SQL query of > postgres but were not exported into the gdb. > > > > > > The id's are: > > > PF10_0168 > > > PF11_0361 (duplicated 2x for total of 3 entries) > > > PF11_0377 > > > PF11_0405 > > > PFB0305c > > > PFB0391c > > > > > > Surprisingly (to me anyway) I noticed that the duplicates all had > unique hjid's, which may or may not mean anything... > > > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely > correct. > > > > > > Dr. D, does that make sense? > > > > > > Richard > > > > > > Sent from my iPhone > > > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > > > >> Yes I understand that you only want the 'ORF' but was trying to get > what we need by modifying what we have and not rewite the whole query. > > >> > > >> I may in fact have to rewrite the whole thing anyway as my query > didn't return what was expected =/ > > >> > > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll > keep an eye out for missing id's here. > > >> > > >> Richard > > >> > > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> > wrote: > > >> Hi, > > >> > > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do > the pattern match at all. As far as I can tell with my analysis of the IDs > in the query you posted, we need to keep them all, so we don't actually need > to specify the patterns at this point. I would rather do that thinking > towards the future when the Plasmodium people might add new patterns to the > ID system. > > >> > > >> I believe that the > > >> > > >> "MAL[0-9]*P1.[0-9]*" > > >> > > >> pattern is also not pulling out everything it needs to. But instead > > >> of including more patterns, I would rather just loosen up the criteria > to > > >> include all things in the ORF tag. > > >> > > >> Also, I just want to be clear about the underscore issue. That only > > >> affects IDs that begin with PFA, not the other IDs that begin with > PF##_ > > >> > > >> Thanks, > > >> Kam > > >> > > >> > > >> > > >> > > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > > >>> I'm down in the bio lab at the moment looking at this. > > >>> > > >>> I understand what needs to be done in regards to 1) keeping all ORF > id's and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > > >>> > > >>> I performed an export with the pattern match as-is and commented out > the exclude 'ordered locus' and 'ORF'. The export completed but with only > 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML > file as seen from the raw sql query. > > >>> > > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went > ahead and added it into the pattern match string and called it as the others > are called in the query. Once the export completes I'll confirm that in fact > we have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > > >>> > > >>> Richard > > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> > wrote: > > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > > >>> > > >>> Richard > > >>> > > >>> Sent from my iPhone > > >>> > > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> > wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> More information on the underscore issue: > > >>>> > > >>>> There is an ID with the pattern > > >>>> > > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> that needs to have the underscore removed so that reads instead > > >>>> > > >>>> PFA[0-9][0-9][0-9][0-9][wc] > > >>>> > > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they > are there without the underscore and won't be recognized with it. I think > we should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > > >>>> > > >>>> If Rich can make these two fixes (capturing the ORFs and dealing > with the underscore), then I think we will be good to go with Plasmodium. > There may be code in the Vibrio or Helicobacter profiles to help with the > underscores, but I'm not sure. > > >>>> > > >>>> Best, > > >>>> Kam > > >>>> > > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > >>>>> Hi, > > >>>>> > > >>>>> I've taken a look at the list of IDs and did a quick comparison > with both the older released gdb and also a list I downloaded from the Broad > Institute Plasmodium database. I think we can safely go with the query on > the ORF tag for our export--all of those different ID forms are valid. > There are about 400 IDs that are different in the older released gdb than > in the new query; I'm going to further investigate those. I suspect that > the difference is mainly due to a +/- underscore issue that we might need to > solve. However, we should go forward with capturing all the IDs from the > ORF tag, I don't see a need to restrict to a particular pattern there. > > >>>>> > > >>>>> Best, > > >>>>> Kam > > >>>>> > > >>>>> At 09:48 PM 3/14/2011, you wrote: > > >>>>>> Hi all, > > >>>>>> > > >>>>>> So I went ahead and did raw sql queries of the Postgres data and > turned up the following: > > >>>>>> > > >>>>>> select * from genenametype where type = 'ordered locus' > > >>>>>> Returned zero gene ids > > >>>>>> > > >>>>>> select * from genenametype where type = 'ORF' > > >>>>>> Returned 5345 gene ids > > >>>>>> The type = 'ORF' query was exported into excel and posted to the > biodb wiki on the Spring 2011 Plasmodium page. > > >>>>>> > > >>>>>> There are many many patterns in regards to gene ids, here the the > prefixes from my cursory look: > > >>>>>> MAL > > >>>>>> PF##_ > > >>>>>> PFA > > >>>>>> PFB > > >>>>>> PFC > > >>>>>> PFD > > >>>>>> PFE > > >>>>>> PFF > > >>>>>> PFI > > >>>>>> PFL > > >>>>>> > > >>>>>> Richard > > >>>>>> > > >>>>>> > > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist < > kda...@lm...> wrote: > > >>>>>> Hi, > > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that > it appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > > >>>>>> Best, > > >>>>>> Kam > > >>>>>> At 08:09 AM 3/14/2011, you wrote: > > >>>>>>> Thanks Dondi, > > >>>>>>> > > >>>>>>> Will review this after our call today. I have been a little > worried as the DEBUG export has been going for 2.5 days with progress at 65% > and 6.5 Gb of log files so far... /yikes > > >>>>>>> > > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be > working from home afterwards instead of in the bio lab. > > >>>>>>> > > >>>>>>> Richard > > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio < > do...@lm...> wrote: > > >>>>>>> Thanks for the updates, Rich. > > >>>>>>> I gave things a once-over and may have a lead. Here is what I > found: > > >>>>>>> - First, the TallyEngine customization for P. falciparum states > the following: > > >>>>>>> # Plasmodium falciparum > > >>>>>>> plasmodiumfalciparum_level_amount=2 > > >>>>>>> > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > > >>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from > genenametype where type = 'ORF'; > > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from > genenametype where type = 'UniGene'; > > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are > the gene names whose type is 'ORF' ("level0" properties). > > >>>>>>> - Now, this is what the P. falciparum species profile does when > harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, > c.value " + > > >>>>>>> "from genenametype c inner join entrytype_genetype d " > + > > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > >>>>>>> "where (c.value similar to ? " + > > >>>>>>> "or c.value similar to ? " + > > >>>>>>> "or c.value similar to ?) " + > > >>>>>>> "and type <> 'ordered locus names' " + > > >>>>>>> "and type <> 'ORF' " + > > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > > >>>>>>> Note the condition on the second-to-last line --- the query > actually *omits* gene names whose type is 'ORF'! So the question is...which > is right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > > >>>>>>> Still, comparing these two queries directly against the > PostgreSQL database would be educational, I think. Then, knowing which > criteria are correct, the appropriate action can then be taken, I think. > > >>>>>>> Hope this helps... > > >>>>>>> John David N. Dionisio, PhD > > >>>>>>> Associate Professor, Computer Science > > >>>>>>> Loyola Marymount University > > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > >>>>>>> > Debug export is still going... 2.5GB of log files so far with > progress at 65%... > > >>>>>>> > > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > >>>>>>> > Richard > > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > > >>>>>>> > Hi all, > > >>>>>>> > > > >>>>>>> > Have been working through several Plasmodium gdb exports in an > attempt to source why only one gene id makes it into the Ordered Locus > table. > > >>>>>>> > > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't > able to determine anything which would suggest an error. I will post this > log file to the wiki later today when I get home. > > >>>>>>> > > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to > 100MB with hopes that more detail will surface the issue, but my export is > on hour 20 and still going (although its nearly complete). What I didn't > expect was the size of the log files and that it seems only the last 3 are > kept with earlier logs being overwritten =( I fear that the information I > need it in one of the earlier files which are now lost. > > >>>>>>> > > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an > export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each > and hope that 3 GB total will be enough to hold the complete export log. > > >>>>>>> > > > >>>>>>> > More info as it comes... > > >>>>>>> > > > >>>>>>> > Richard > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist < > kda...@lm...> wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last > November and updated the SourceForge wiki. > > >>>>>>> > > > >>>>>>> > Plasmodium has it's own task list page, which I've updated > here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > >>>>>>> > > > >>>>>>> > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > >>>>>>> > > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum > page on the Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > >>>>>>> > > > >>>>>>> > Here is the list of bugs/action items that I've listed: > > >>>>>>> > > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of > 5345 repored by the TallyEngine. This also affects all other tables related > to OrderedLocusNames. > > >>>>>>> > > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than > reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium > tuberculosis also have mysterious GeneId issues with the TallyEngine). > > >>>>>>> > > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's > lower than the 2009 version of the gdb. There's no way to tell at this point > whether this is due to a change in annotation by UniProt or is a bug with > GenMAPP Builder. > > >>>>>>> > > > >>>>>>> > Thanks, > > >>>>>>> > Kam > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > ------------------------------------------------------------------------------ > > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > > >>>>>>> > This paper provides an overview of data connectivity, details > > >>>>>>> > its effect on application quality, and explores various > alternative > > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > > >>>>>>> > _______________________________________________ > > >>>>>>> > xmlpipedb-developer mailing list > > >>>>>>> > xml...@li... > > >>>>>>> > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > > >>>>>>> > ------------------------------------------------------------------------------ > > >>>>>>> Colocation vs. Managed Hosting > > >>>>>>> A question and answer guide to determining the best fit > > >>>>>>> for your organization - today and in the future. > > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > > >>>>>>> _______________________________________________ > > >>>>>>> xmlpipedb-developer mailing list > > >>>>>>> xml...@li... > > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > >> > > >> > ------------------------------------------------------------------------------ > > >> Colocation vs. Managed Hosting > > >> A question and answer guide to determining the best fit > > >> for your organization - today and in the future. > > >> http://p.sf.net/sfu/internap-sfd2d > > >> _______________________________________________ > > >> xmlpipedb-developer mailing list > > >> xml...@li... > > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > >> > > >> > > > <ATT00001..txt><ATT00002..txt> > > > > > > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > > A question and answer guide to determining the best fit > > for your organization - today and in the future. > > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-03-20 23:01:05
|
Thanks. Try reinstating that last group by clause; it may remove some duplicate hjid/value pairs. John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 20, 2011, at 1:54 PM, Richard Brous wrote: > Sure thing. Can send now that gmail working again... > > /** > > * > > @see edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, java.util.Date) > */ > > > @Override > > public TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException { > > // Start with the default OrderedLocusNames behavior. > TableManager result = > > super.getSystemTableManagerCustomizations(tableManager, primarySystemTableManager, version); > > > // Next, we add IDs from the other gene/name tags, but ONLY if they match > > // the pattern PF[A-Z][0-9]{4}[a-z]. > > //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; > > //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; > > //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; > String sqlQuery = > > "select d.entrytype_gene_hjid as hjid, c.value " + > > "from genenametype c inner join entrytype_genetype d " + > > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > > //"where (c.value similar to ? " + > > "where c.type = 'ORF'"; > > //"or c.value similar to ? " + > > //"or c.value similar to ?) " + > > //"and type <> 'ordered locus names' " + > > //"and type <> 'ORF' " + > > //"group by d.entrytype_gene_hjid, c.value"; > > String dateToday = GenMAPPBuilderUtilities.getSystemsDateString(version); > > Connection c = ConnectionManager.getRelationalDBConnection(); > > PreparedStatement ps; > > ResultSet rs; > > > try { > > // Query, iterate, add to table manager. > ps = c.prepareStatement(sqlQuery); > > > //ps.setString(1, pfID); > > //ps.setString(2, pfID2); > > //ps.setString(3, pfID3); > rs = ps.executeQuery(); > > > while (rs.next()) { > String hjid = Long.valueOf(rs.getLong( > > "hjid")).toString(); > String id = rs.getString( > > "value"); > > _Log.debug("Processing raw ID: " + id + " for surrogate " + hjid); > tableManager.submit( > > "OrderedLocusNames", QueryType.insert, new String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { "UID", hjid } }); > } > > } > > catch(SQLException sqlexc) { > logSQLException(sqlexc, sqlQuery); > > } > > > > return result; > } > > > > On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...> wrote: > Hi Rich, > > Thanks for the updates. Did you post the new SQL query that you used somewhere? Let me know where I can take a look at it. Or, if you haven't put it out there yet, send it my way just for a quick look-see. Thanks! > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > Having weird probs with gmail in browser where the reply button isn't working so have to send from my phone... Odd > > > > So I here is my analysis on the ORF only export gdb: > > > > Initially the raw SQL query returned 5345 records of which I posted an excel doc link to the plasmodium page on the biodb wiki. > > > > The gdb contained only 5338 genes id's so I started looking for duplicates or exclusions. > > > > What I found were 7 id's that were duplicated in the raw SQL query of postgres but were not exported into the gdb. > > > > The id's are: > > PF10_0168 > > PF11_0361 (duplicated 2x for total of 3 entries) > > PF11_0377 > > PF11_0405 > > PFB0305c > > PFB0391c > > > > Surprisingly (to me anyway) I noticed that the duplicates all had unique hjid's, which may or may not mean anything... > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely correct. > > > > Dr. D, does that make sense? > > > > Richard > > > > Sent from my iPhone > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > >> Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. > >> > >> I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ > >> > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. > >> > >> Richard > >> > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: > >> Hi, > >> > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. > >> > >> I believe that the > >> > >> "MAL[0-9]*P1.[0-9]*" > >> > >> pattern is also not pulling out everything it needs to. But instead > >> of including more patterns, I would rather just loosen up the criteria to > >> include all things in the ORF tag. > >> > >> Also, I just want to be clear about the underscore issue. That only > >> affects IDs that begin with PFA, not the other IDs that begin with PF##_ > >> > >> Thanks, > >> Kam > >> > >> > >> > >> > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > >>> I'm down in the bio lab at the moment looking at this. > >>> > >>> I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. > >>> > >>> I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. > >>> > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. > >>> > >>> Richard > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > >>> > >>> Richard > >>> > >>> Sent from my iPhone > >>> > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > >>> > >>>> Hi, > >>>> > >>>> More information on the underscore issue: > >>>> > >>>> There is an ID with the pattern > >>>> > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> that needs to have the underscore removed so that reads instead > >>>> > >>>> PFA[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. > >>>> > >>>> If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. > >>>> > >>>> Best, > >>>> Kam > >>>> > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > >>>>> Hi, > >>>>> > >>>>> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. > >>>>> > >>>>> Best, > >>>>> Kam > >>>>> > >>>>> At 09:48 PM 3/14/2011, you wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: > >>>>>> > >>>>>> select * from genenametype where type = 'ordered locus' > >>>>>> Returned zero gene ids > >>>>>> > >>>>>> select * from genenametype where type = 'ORF' > >>>>>> Returned 5345 gene ids > >>>>>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. > >>>>>> > >>>>>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: > >>>>>> MAL > >>>>>> PF##_ > >>>>>> PFA > >>>>>> PFB > >>>>>> PFC > >>>>>> PFD > >>>>>> PFE > >>>>>> PFF > >>>>>> PFI > >>>>>> PFL > >>>>>> > >>>>>> Richard > >>>>>> > >>>>>> > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: > >>>>>> Hi, > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). > >>>>>> Best, > >>>>>> Kam > >>>>>> At 08:09 AM 3/14/2011, you wrote: > >>>>>>> Thanks Dondi, > >>>>>>> > >>>>>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes > >>>>>>> > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. > >>>>>>> > >>>>>>> Richard > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: > >>>>>>> Thanks for the updates, Rich. > >>>>>>> I gave things a once-over and may have a lead. Here is what I found: > >>>>>>> - First, the TallyEngine customization for P. falciparum states the following: > >>>>>>> # Plasmodium falciparum > >>>>>>> plasmodiumfalciparum_level_amount=2 > >>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > >>>>>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). > >>>>>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > >>>>>>> "from genenametype c inner join entrytype_genetype d " + > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > >>>>>>> "where (c.value similar to ? " + > >>>>>>> "or c.value similar to ? " + > >>>>>>> "or c.value similar to ?) " + > >>>>>>> "and type <> 'ordered locus names' " + > >>>>>>> "and type <> 'ORF' " + > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > >>>>>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) > >>>>>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. > >>>>>>> Hope this helps... > >>>>>>> John David N. Dionisio, PhD > >>>>>>> Associate Professor, Computer Science > >>>>>>> Loyola Marymount University > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > >>>>>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... > >>>>>>> > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > >>>>>>> > Richard > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: > >>>>>>> > Hi all, > >>>>>>> > > >>>>>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. > >>>>>>> > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. > >>>>>>> > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. > >>>>>>> > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. > >>>>>>> > > >>>>>>> > More info as it comes... > >>>>>>> > > >>>>>>> > Richard > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. > >>>>>>> > > >>>>>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > >>>>>>> > > >>>>>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > >>>>>>> > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >>>>>>> > > >>>>>>> > Here is the list of bugs/action items that I've listed: > >>>>>>> > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. > >>>>>>> > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). > >>>>>>> > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. > >>>>>>> > > >>>>>>> > Thanks, > >>>>>>> > Kam > >>>>>>> > > >>>>>>> > > >>>>>>> > ------------------------------------------------------------------------------ > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > >>>>>>> > This paper provides an overview of data connectivity, details > >>>>>>> > its effect on application quality, and explores various alternative > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > >>>>>>> > _______________________________________________ > >>>>>>> > xmlpipedb-developer mailing list > >>>>>>> > xml...@li... > >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > >>>>>>> ------------------------------------------------------------------------------ > >>>>>>> Colocation vs. Managed Hosting > >>>>>>> A question and answer guide to determining the best fit > >>>>>>> for your organization - today and in the future. > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > >>>>>>> _______________________________________________ > >>>>>>> xmlpipedb-developer mailing list > >>>>>>> xml...@li... > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > >> > >> ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > > <ATT00001..txt><ATT00002..txt> > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-03-20 20:54:24
|
Sure thing. Can send now that gmail working again... /** * *@see*edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.profiles.UniProtSpeciesProfile#getSystemTableManagerCustomizations(edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, edu.lmu.xmlpipedb.gmbuilder.databasetoolkit.tables.TableManager, java.util.Date) */ @Override *public* TableManager getSystemTableManagerCustomizations(TableManager tableManager, TableManager primarySystemTableManager, Date version) *throws*SQLException, InvalidParameterException { // Start with the default OrderedLocusNames behavior. TableManager result = *super*.getSystemTableManagerCustomizations(tableManager, primarySystemTableManager, version); // Next, we add IDs from the other gene/name tags, but ONLY if they match // the pattern PF[A-Z][0-9]{4}[a-z]. //final String pfID = "PF[A-Z][0-9][0-9][0-9][0-9][a-z]"; //final String pfID2 = "PF[0-9][0-9]_[0-9][0-9][0-9][0-9]"; //final String pfID3 = "MAL[0-9]*P1.[0-9]*"; String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + "from genenametype c inner join entrytype_genetype d " + "on (c.entrytype_genetype_name_hjid = d.hjid) " + //"where (c.value similar to ? " + "where c.type = 'ORF'"; //"or c.value similar to ? " + //"or c.value similar to ?) " + //"and type <> 'ordered locus names' " + //"and type <> 'ORF' " + //"group by d.entrytype_gene_hjid, c.value"; String dateToday = GenMAPPBuilderUtilities.*getSystemsDateString*(version); Connection c = ConnectionManager.*getRelationalDBConnection*(); PreparedStatement ps; ResultSet rs; *try* { // Query, iterate, add to table manager. ps = c.prepareStatement(sqlQuery); //ps.setString(1, pfID); //ps.setString(2, pfID2); //ps.setString(3, pfID3); rs = ps.executeQuery(); *while* (rs.next()) { String hjid = Long.*valueOf*(rs.getLong("hjid")).toString(); String id = rs.getString("value"); *_Log*.debug("Processing raw ID: " + id + " for surrogate " + hjid); tableManager.submit("OrderedLocusNames", QueryType.*insert*, *new*String[][] { { "ID", id }, { "Species", "|" + getSpeciesName() + "|" }, { "\"Date\"", dateToday }, { "UID", hjid } }); } } *catch*(SQLException sqlexc) { logSQLException(sqlexc, sqlQuery); } *return* result; } On Sun, Mar 20, 2011 at 1:28 PM, John David N. Dionisio <do...@lm...>wrote: > Hi Rich, > > Thanks for the updates. Did you post the new SQL query that you used > somewhere? Let me know where I can take a look at it. Or, if you haven't > put it out there yet, send it my way just for a quick look-see. Thanks! > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > > > Having weird probs with gmail in browser where the reply button isn't > working so have to send from my phone... Odd > > > > So I here is my analysis on the ORF only export gdb: > > > > Initially the raw SQL query returned 5345 records of which I posted an > excel doc link to the plasmodium page on the biodb wiki. > > > > The gdb contained only 5338 genes id's so I started looking for > duplicates or exclusions. > > > > What I found were 7 id's that were duplicated in the raw SQL query of > postgres but were not exported into the gdb. > > > > The id's are: > > PF10_0168 > > PF11_0361 (duplicated 2x for total of 3 entries) > > PF11_0377 > > PF11_0405 > > PFB0305c > > PFB0391c > > > > Surprisingly (to me anyway) I noticed that the duplicates all had unique > hjid's, which may or may not mean anything... > > > > In conclusion, it seems to me that the 5338 id's in the gdb are likely > correct. > > > > Dr. D, does that make sense? > > > > Richard > > > > Sent from my iPhone > > > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > > > >> Yes I understand that you only want the 'ORF' but was trying to get what > we need by modifying what we have and not rewite the whole query. > >> > >> I may in fact have to rewrite the whole thing anyway as my query didn't > return what was expected =/ > >> > >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep > an eye out for missing id's here. > >> > >> Richard > >> > >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> > wrote: > >> Hi, > >> > >> I think we need to capture all of the IDs in the ORF tag and *NOT* do > the pattern match at all. As far as I can tell with my analysis of the IDs > in the query you posted, we need to keep them all, so we don't actually need > to specify the patterns at this point. I would rather do that thinking > towards the future when the Plasmodium people might add new patterns to the > ID system. > >> > >> I believe that the > >> > >> "MAL[0-9]*P1.[0-9]*" > >> > >> pattern is also not pulling out everything it needs to. But instead > >> of including more patterns, I would rather just loosen up the criteria > to > >> include all things in the ORF tag. > >> > >> Also, I just want to be clear about the underscore issue. That only > >> affects IDs that begin with PFA, not the other IDs that begin with PF##_ > >> > >> Thanks, > >> Kam > >> > >> > >> > >> > >> At 12:47 PM 3/18/2011, Richard Brous wrote: > >>> I'm down in the bio lab at the moment looking at this. > >>> > >>> I understand what needs to be done in regards to 1) keeping all ORF > id's and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > >>> > >>> I performed an export with the pattern match as-is and commented out > the exclude 'ordered locus' and 'ORF'. The export completed but with only > 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML > file as seen from the raw sql query. > >>> > >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went > ahead and added it into the pattern match string and called it as the others > are called in the query. Once the export completes I'll confirm that in fact > we have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > >>> > >>> Richard > >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> > wrote: > >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. > >>> > >>> Richard > >>> > >>> Sent from my iPhone > >>> > >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > >>> > >>>> Hi, > >>>> > >>>> More information on the underscore issue: > >>>> > >>>> There is an ID with the pattern > >>>> > >>>> PFA_[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> that needs to have the underscore removed so that reads instead > >>>> > >>>> PFA[0-9][0-9][0-9][0-9][wc] > >>>> > >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > >>>> > >>>> If Rich can make these two fixes (capturing the ORFs and dealing with > the underscore), then I think we will be good to go with Plasmodium. There > may be code in the Vibrio or Helicobacter profiles to help with the > underscores, but I'm not sure. > >>>> > >>>> Best, > >>>> Kam > >>>> > >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > >>>>> Hi, > >>>>> > >>>>> I've taken a look at the list of IDs and did a quick comparison with > both the older released gdb and also a list I downloaded from the Broad > Institute Plasmodium database. I think we can safely go with the query on > the ORF tag for our export--all of those different ID forms are valid. > There are about 400 IDs that are different in the older released gdb than > in the new query; I'm going to further investigate those. I suspect that > the difference is mainly due to a +/- underscore issue that we might need to > solve. However, we should go forward with capturing all the IDs from the > ORF tag, I don't see a need to restrict to a particular pattern there. > >>>>> > >>>>> Best, > >>>>> Kam > >>>>> > >>>>> At 09:48 PM 3/14/2011, you wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> So I went ahead and did raw sql queries of the Postgres data and > turned up the following: > >>>>>> > >>>>>> select * from genenametype where type = 'ordered locus' > >>>>>> Returned zero gene ids > >>>>>> > >>>>>> select * from genenametype where type = 'ORF' > >>>>>> Returned 5345 gene ids > >>>>>> The type = 'ORF' query was exported into excel and posted to the > biodb wiki on the Spring 2011 Plasmodium page. > >>>>>> > >>>>>> There are many many patterns in regards to gene ids, here the the > prefixes from my cursory look: > >>>>>> MAL > >>>>>> PF##_ > >>>>>> PFA > >>>>>> PFB > >>>>>> PFC > >>>>>> PFD > >>>>>> PFE > >>>>>> PFF > >>>>>> PFI > >>>>>> PFL > >>>>>> > >>>>>> Richard > >>>>>> > >>>>>> > >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> > wrote: > >>>>>> Hi, > >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that > it appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > >>>>>> Best, > >>>>>> Kam > >>>>>> At 08:09 AM 3/14/2011, you wrote: > >>>>>>> Thanks Dondi, > >>>>>>> > >>>>>>> Will review this after our call today. I have been a little worried > as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 > Gb of log files so far... /yikes > >>>>>>> > >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be > working from home afterwards instead of in the bio lab. > >>>>>>> > >>>>>>> Richard > >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio < > do...@lm...> wrote: > >>>>>>> Thanks for the updates, Rich. > >>>>>>> I gave things a once-over and may have a lead. Here is what I > found: > >>>>>>> - First, the TallyEngine customization for P. falciparum states the > following: > >>>>>>> # Plasmodium falciparum > >>>>>>> plasmodiumfalciparum_level_amount=2 > >>>>>>> > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > >>>>>>> > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype > where type = 'ORF'; > >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype > where type = 'UniGene'; > >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus > >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene > >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are > the gene names whose type is 'ORF' ("level0" properties). > >>>>>>> - Now, this is what the P. falciparum species profile does when > harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, > c.value " + > >>>>>>> "from genenametype c inner join entrytype_genetype d " + > >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + > >>>>>>> "where (c.value similar to ? " + > >>>>>>> "or c.value similar to ? " + > >>>>>>> "or c.value similar to ?) " + > >>>>>>> "and type <> 'ordered locus names' " + > >>>>>>> "and type <> 'ORF' " + > >>>>>>> "group by d.entrytype_gene_hjid, c.value"; > >>>>>>> Note the condition on the second-to-last line --- the query > actually *omits* gene names whose type is 'ORF'! So the question is...which > is right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > >>>>>>> Still, comparing these two queries directly against the PostgreSQL > database would be educational, I think. Then, knowing which criteria are > correct, the appropriate action can then be taken, I think. > >>>>>>> Hope this helps... > >>>>>>> John David N. Dionisio, PhD > >>>>>>> Associate Professor, Computer Science > >>>>>>> Loyola Marymount University > >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > >>>>>>> > Debug export is still going... 2.5GB of log files so far with > progress at 65%... > >>>>>>> > > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > >>>>>>> > Richard > >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > >>>>>>> > Hi all, > >>>>>>> > > >>>>>>> > Have been working through several Plasmodium gdb exports in an > attempt to source why only one gene id makes it into the Ordered Locus > table. > >>>>>>> > > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't > able to determine anything which would suggest an error. I will post this > log file to the wiki later today when I get home. > >>>>>>> > > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to > 100MB with hopes that more detail will surface the issue, but my export is > on hour 20 and still going (although its nearly complete). What I didn't > expect was the size of the log files and that it seems only the last 3 are > kept with earlier logs being overwritten =( I fear that the information I > need it in one of the earlier files which are now lost. > >>>>>>> > > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an > export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each > and hope that 3 GB total will be enough to hold the complete export log. > >>>>>>> > > >>>>>>> > More info as it comes... > >>>>>>> > > >>>>>>> > Richard > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist < > kda...@lm...> wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > I've completed testing the Plasmodium gdb I exported last > November and updated the SourceForge wiki. > >>>>>>> > > >>>>>>> > Plasmodium has it's own task list page, which I've updated here: > > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > >>>>>>> > > >>>>>>> > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > >>>>>>> > > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page > on the Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > >>>>>>> > > >>>>>>> > Here is the list of bugs/action items that I've listed: > >>>>>>> > > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of > 5345 repored by the TallyEngine. This also affects all other tables related > to OrderedLocusNames. > >>>>>>> > > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than > reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium > tuberculosis also have mysterious GeneId issues with the TallyEngine). > >>>>>>> > > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower > than the 2009 version of the gdb. There's no way to tell at this point > whether this is due to a change in annotation by UniProt or is a bug with > GenMAPP Builder. > >>>>>>> > > >>>>>>> > Thanks, > >>>>>>> > Kam > >>>>>>> > > >>>>>>> > > >>>>>>> > > ------------------------------------------------------------------------------ > >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You > >>>>>>> > This paper provides an overview of data connectivity, details > >>>>>>> > its effect on application quality, and explores various > alternative > >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d > >>>>>>> > _______________________________________________ > >>>>>>> > xmlpipedb-developer mailing list > >>>>>>> > xml...@li... > >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > <ATT00001..txt><ATT00002..txt> > >>>>>>> > ------------------------------------------------------------------------------ > >>>>>>> Colocation vs. Managed Hosting > >>>>>>> A question and answer guide to determining the best fit > >>>>>>> for your organization - today and in the future. > >>>>>>> http://p.sf.net/sfu/internap-sfd2d > >>>>>>> _______________________________________________ > >>>>>>> xmlpipedb-developer mailing list > >>>>>>> xml...@li... > >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> Colocation vs. Managed Hosting > >> A question and answer guide to determining the best fit > >> for your organization - today and in the future. > >> http://p.sf.net/sfu/internap-sfd2d > >> _______________________________________________ > >> xmlpipedb-developer mailing list > >> xml...@li... > >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > >> > >> > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-03-20 20:28:53
|
Hi Rich, Thanks for the updates. Did you post the new SQL query that you used somewhere? Let me know where I can take a look at it. Or, if you haven't put it out there yet, send it my way just for a quick look-see. Thanks! John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 20, 2011, at 12:15 PM, Richard Brous wrote: > Having weird probs with gmail in browser where the reply button isn't working so have to send from my phone... Odd > > So I here is my analysis on the ORF only export gdb: > > Initially the raw SQL query returned 5345 records of which I posted an excel doc link to the plasmodium page on the biodb wiki. > > The gdb contained only 5338 genes id's so I started looking for duplicates or exclusions. > > What I found were 7 id's that were duplicated in the raw SQL query of postgres but were not exported into the gdb. > > The id's are: > PF10_0168 > PF11_0361 (duplicated 2x for total of 3 entries) > PF11_0377 > PF11_0405 > PFB0305c > PFB0391c > > Surprisingly (to me anyway) I noticed that the duplicates all had unique hjid's, which may or may not mean anything... > > In conclusion, it seems to me that the 5338 id's in the gdb are likely correct. > > Dr. D, does that make sense? > > Richard > > Sent from my iPhone > > On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > >> Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. >> >> I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ >> >> Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. >> >> Richard >> >> On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: >> Hi, >> >> I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. >> >> I believe that the >> >> "MAL[0-9]*P1.[0-9]*" >> >> pattern is also not pulling out everything it needs to. But instead >> of including more patterns, I would rather just loosen up the criteria to >> include all things in the ORF tag. >> >> Also, I just want to be clear about the underscore issue. That only >> affects IDs that begin with PFA, not the other IDs that begin with PF##_ >> >> Thanks, >> Kam >> >> >> >> >> At 12:47 PM 3/18/2011, Richard Brous wrote: >>> I'm down in the bio lab at the moment looking at this. >>> >>> I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. >>> >>> I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. >>> >>> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. >>> >>> Richard >>> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: >>> Thanks for info. I will dig into this after my 10 am exam tomorrow. >>> >>> Richard >>> >>> Sent from my iPhone >>> >>> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: >>> >>>> Hi, >>>> >>>> More information on the underscore issue: >>>> >>>> There is an ID with the pattern >>>> >>>> PFA_[0-9][0-9][0-9][0-9][wc] >>>> >>>> that needs to have the underscore removed so that reads instead >>>> >>>> PFA[0-9][0-9][0-9][0-9][wc] >>>> >>>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. >>>> >>>> If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. >>>> >>>> Best, >>>> Kam >>>> >>>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >>>>> Hi, >>>>> >>>>> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. >>>>> >>>>> Best, >>>>> Kam >>>>> >>>>> At 09:48 PM 3/14/2011, you wrote: >>>>>> Hi all, >>>>>> >>>>>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: >>>>>> >>>>>> select * from genenametype where type = 'ordered locus' >>>>>> Returned zero gene ids >>>>>> >>>>>> select * from genenametype where type = 'ORF' >>>>>> Returned 5345 gene ids >>>>>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. >>>>>> >>>>>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: >>>>>> MAL >>>>>> PF##_ >>>>>> PFA >>>>>> PFB >>>>>> PFC >>>>>> PFD >>>>>> PFE >>>>>> PFF >>>>>> PFI >>>>>> PFL >>>>>> >>>>>> Richard >>>>>> >>>>>> >>>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: >>>>>> Hi, >>>>>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). >>>>>> Best, >>>>>> Kam >>>>>> At 08:09 AM 3/14/2011, you wrote: >>>>>>> Thanks Dondi, >>>>>>> >>>>>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes >>>>>>> >>>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. >>>>>>> >>>>>>> Richard >>>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: >>>>>>> Thanks for the updates, Rich. >>>>>>> I gave things a once-over and may have a lead. Here is what I found: >>>>>>> - First, the TallyEngine customization for P. falciparum states the following: >>>>>>> # Plasmodium falciparum >>>>>>> plasmodiumfalciparum_level_amount=2 >>>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>>>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; >>>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; >>>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus >>>>>>> plasmodiumfalciparum_table_name_level1=UniGene >>>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). >>>>>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + >>>>>>> "from genenametype c inner join entrytype_genetype d " + >>>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>>>>> "where (c.value similar to ? " + >>>>>>> "or c.value similar to ? " + >>>>>>> "or c.value similar to ?) " + >>>>>>> "and type <> 'ordered locus names' " + >>>>>>> "and type <> 'ORF' " + >>>>>>> "group by d.entrytype_gene_hjid, c.value"; >>>>>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) >>>>>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. >>>>>>> Hope this helps... >>>>>>> John David N. Dionisio, PhD >>>>>>> Associate Professor, Computer Science >>>>>>> Loyola Marymount University >>>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>>>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... >>>>>>> > >>>>>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . >>>>>>> > Richard >>>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: >>>>>>> > Hi all, >>>>>>> > >>>>>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. >>>>>>> > >>>>>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. >>>>>>> > >>>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. >>>>>>> > >>>>>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. >>>>>>> > >>>>>>> > More info as it comes... >>>>>>> > >>>>>>> > Richard >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. >>>>>>> > >>>>>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>>>>> > >>>>>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>>>>> > >>>>>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>>>> > >>>>>>> > Here is the list of bugs/action items that I've listed: >>>>>>> > >>>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. >>>>>>> > >>>>>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). >>>>>>> > >>>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Kam >>>>>>> > >>>>>>> > >>>>>>> > ------------------------------------------------------------------------------ >>>>>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>>>>> > This paper provides an overview of data connectivity, details >>>>>>> > its effect on application quality, and explores various alternative >>>>>>> > solutions. http://p.sf.net/sfu/progress-d2d >>>>>>> > _______________________________________________ >>>>>>> > xmlpipedb-developer mailing list >>>>>>> > xml...@li... >>>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > <ATT00001..txt><ATT00002..txt> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Colocation vs. Managed Hosting >>>>>>> A question and answer guide to determining the best fit >>>>>>> for your organization - today and in the future. >>>>>>> http://p.sf.net/sfu/internap-sfd2d >>>>>>> _______________________________________________ >>>>>>> xmlpipedb-developer mailing list >>>>>>> xml...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> >> >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> _______________________________________________ >> xmlpipedb-developer mailing list >> xml...@li... >> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > <ATT00001..txt><ATT00002..txt> |
From: Richard B. <rbr...@gm...> - 2011-03-20 19:15:17
|
Having weird probs with gmail in browser where the reply button isn't working so have to send from my phone... Odd So I here is my analysis on the ORF only export gdb: Initially the raw SQL query returned 5345 records of which I posted an excel doc link to the plasmodium page on the biodb wiki. The gdb contained only 5338 genes id's so I started looking for duplicates or exclusions. What I found were 7 id's that were duplicated in the raw SQL query of postgres but were not exported into the gdb. The id's are: PF10_0168 PF11_0361 (duplicated 2x for total of 3 entries) PF11_0377 PF11_0405 PFB0305c PFB0391c Surprisingly (to me anyway) I noticed that the duplicates all had unique hjid's, which may or may not mean anything... In conclusion, it seems to me that the 5338 id's in the gdb are likely correct. Dr. D, does that make sense? Richard Sent from my iPhone On Mar 18, 2011, at 1:16 PM, Richard Brous <rbr...@gm...> wrote: > Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. > > I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ > > Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. > > Richard > > On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. > > I believe that the > > "MAL[0-9]*P1.[0-9]*" > > pattern is also not pulling out everything it needs to. But instead > of including more patterns, I would rather just loosen up the criteria to > include all things in the ORF tag. > > Also, I just want to be clear about the underscore issue. That only > affects IDs that begin with PFA, not the other IDs that begin with PF##_ > > Thanks, > Kam > > > > At 12:47 PM 3/18/2011, Richard Brous wrote: >> >> I'm down in the bio lab at the moment looking at this. >> >> I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. >> >> I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. >> >> Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. >> >> Richard >> On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: >> Thanks for info. I will dig into this after my 10 am exam tomorrow. >> >> Richard >> >> Sent from my iPhone >> >> On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: >> >>> Hi, >>> >>> More information on the underscore issue: >>> >>> There is an ID with the pattern >>> >>> PFA_[0-9][0-9][0-9][0-9][wc] >>> >>> that needs to have the underscore removed so that reads instead >>> >>> PFA[0-9][0-9][0-9][0-9][wc] >>> >>> I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. >>> >>> If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. >>> >>> Best, >>> Kam >>> >>> At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >>>> Hi, >>>> >>>> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. >>>> >>>> Best, >>>> Kam >>>> >>>> At 09:48 PM 3/14/2011, you wrote: >>>>> Hi all, >>>>> >>>>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: >>>>> >>>>> select * from genenametype where type = 'ordered locus' >>>>> Returned zero gene ids >>>>> >>>>> select * from genenametype where type = 'ORF' >>>>> Returned 5345 gene ids >>>>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. >>>>> >>>>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: >>>>> MAL >>>>> PF##_ >>>>> PFA >>>>> PFB >>>>> PFC >>>>> PFD >>>>> PFE >>>>> PFF >>>>> PFI >>>>> PFL >>>>> >>>>> Richard >>>>> >>>>> >>>>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: >>>>> Hi, >>>>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). >>>>> Best, >>>>> Kam >>>>> At 08:09 AM 3/14/2011, you wrote: >>>>>> >>>>>> Thanks Dondi, >>>>>> >>>>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes >>>>>> >>>>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. >>>>>> >>>>>> Richard >>>>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: >>>>>> Thanks for the updates, Rich. >>>>>> I gave things a once-over and may have a lead. Here is what I found: >>>>>> - First, the TallyEngine customization for P. falciparum states the following: >>>>>> # Plasmodium falciparum >>>>>> plasmodiumfalciparum_level_amount=2 >>>>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; >>>>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; >>>>>> plasmodiumfalciparum_table_name_level0=Ordered Locus >>>>>> plasmodiumfalciparum_table_name_level1=UniGene >>>>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). >>>>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + >>>>>> "from genenametype c inner join entrytype_genetype d " + >>>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>>>> "where (c.value similar to ? " + >>>>>> "or c.value similar to ? " + >>>>>> "or c.value similar to ?) " + >>>>>> "and type <> 'ordered locus names' " + >>>>>> "and type <> 'ORF' " + >>>>>> "group by d.entrytype_gene_hjid, c.value"; >>>>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) >>>>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. >>>>>> Hope this helps... >>>>>> John David N. Dionisio, PhD >>>>>> Associate Professor, Computer Science >>>>>> Loyola Marymount University >>>>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... >>>>>> > >>>>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . >>>>>> > Richard >>>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: >>>>>> > Hi all, >>>>>> > >>>>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. >>>>>> > >>>>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. >>>>>> > >>>>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. >>>>>> > >>>>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. >>>>>> > >>>>>> > More info as it comes... >>>>>> > >>>>>> > Richard >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. >>>>>> > >>>>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>>>> > >>>>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>>>> > >>>>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>>> > >>>>>> > Here is the list of bugs/action items that I've listed: >>>>>> > >>>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. >>>>>> > >>>>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). >>>>>> > >>>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. >>>>>> > >>>>>> > Thanks, >>>>>> > Kam >>>>>> > >>>>>> > >>>>>> > ------------------------------------------------------------------------------ >>>>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>>>> > This paper provides an overview of data connectivity, details >>>>>> > its effect on application quality, and explores various alternative >>>>>> > solutions. http://p.sf.net/sfu/progress-d2d >>>>>> > _______________________________________________ >>>>>> > xmlpipedb-developer mailing list >>>>>> > xml...@li... >>>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>>> > >>>>>> > >>>>>> > >>>>>> > <ATT00001..txt><ATT00002..txt> >>>>>> ------------------------------------------------------------------------------ >>>>>> Colocation vs. Managed Hosting >>>>>> A question and answer guide to determining the best fit >>>>>> for your organization - today and in the future. >>>>>> http://p.sf.net/sfu/internap-sfd2d >>>>>> _______________________________________________ >>>>>> xmlpipedb-developer mailing list >>>>>> xml...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Richard B. <rbr...@gm...> - 2011-03-18 20:16:16
|
Yes I understand that you only want the 'ORF' but was trying to get what we need by modifying what we have and not rewite the whole query. I may in fact have to rewrite the whole thing anyway as my query didn't return what was expected =/ Also I'm glad you mentioned a possible issue with MAL pattern, I'll keep an eye out for missing id's here. Richard On Fri, Mar 18, 2011 at 12:04 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I think we need to capture all of the IDs in the ORF tag and *NOT* do the > pattern match at all. As far as I can tell with my analysis of the IDs in > the query you posted, we need to keep them all, so we don't actually need to > specify the patterns at this point. I would rather do that thinking towards > the future when the Plasmodium people might add new patterns to the ID > system. > > I believe that the > > "MAL[0-9]*P1.[0-9]*" > > pattern is also not pulling out everything it needs to. But instead > of including more patterns, I would rather just loosen up the criteria to > include all things in the ORF tag. > > Also, I just want to be clear about the underscore issue. That only > affects IDs that begin with PFA, not the other IDs that begin with PF##_ > > Thanks, > Kam > > > > > At 12:47 PM 3/18/2011, Richard Brous wrote: > > I'm down in the bio lab at the moment looking at this. > > I understand what needs to be done in regards to 1) keeping all ORF id's > and then 2) querying the id's with underscores to then remove the > underscores but maintaining the original underscore id's. > > I performed an export with the pattern match as-is and commented out the > exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 > gene id's. So it seems we are missing 235 gene id's that are in the XML file > as seen from the raw sql query. > > Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead > and added it into the pattern match string and called it as the others are > called in the query. Once the export completes I'll confirm that in fact we > have captured all the id's. Once confirmed, I will move onto the query to > find id's with underscores and handle them as mentioned above. > > Richard > On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: > Thanks for info. I will dig into this after my 10 am exam tomorrow. > > Richard > > Sent from my iPhone > > On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > > Hi, > > More information on the underscore issue: > > There is an ID with the pattern > > PFA_[0-9][0-9][0-9][0-9][wc] > > that needs to have the underscore removed so that reads instead > > PFA[0-9][0-9][0-9][0-9][wc] > > I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > > If Rich can make these two fixes (capturing the ORFs and dealing with the > underscore), then I think we will be good to go with Plasmodium. There may > be code in the Vibrio or Helicobacter profiles to help with the underscores, > but I'm not sure. > > Best, > Kam > > At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > Hi, > > I've taken a look at the list of IDs and did a quick comparison with both > the older released gdb and also a list I downloaded from the Broad Institute > Plasmodium database. I think we can safely go with the query on the ORF tag > for our export--all of those different ID forms are valid. There are about > 400 IDs that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the difference is > mainly due to a +/- underscore issue that we might need to solve. However, > we should go forward with capturing all the IDs from the ORF tag, I don't > see a need to restrict to a particular pattern there. > > Best, > Kam > > At 09:48 PM 3/14/2011, you wrote: > > Hi all, > > So I went ahead and did raw sql queries of the Postgres data and turned up > the following: > > select * from genenametype where type = 'ordered locus' > Returned zero gene ids > > select * from genenametype where type = 'ORF' > Returned 5345 gene ids > The type = 'ORF' query was exported into excel and posted to the biodb wiki > on the Spring 2011 Plasmodium page. > > There are many many patterns in regards to gene ids, here the the prefixes > from my cursory look: > MAL > PF##_ > PFA > PFB > PFC > PFD > PFE > PFF > PFI > PFL > > Richard > > > On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> > wrote: Hi, I looked up an assortment of IDs in UniProt and I can confirm > that it appears that the IDs are found in the ORF tag, not the OrderedLocus > tag (except for the one that got captured in the export). Best, Kam > At 08:09 AM 3/14/2011, you wrote: > > Thanks Dondi, Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with progress > at 65% and 6.5 Gb of log files so far... /yikes Btw I have a work lunch > meeting in Beverly Hills today so will be working from home afterwards > instead of in the bio lab. Richard On Sun, Mar 13, 2011 at 9:55 PM, John > David N. Dionisio <do...@lm...> wrote: Thanks for the updates, Rich. I > gave things a once-over and may have a lead. Here is what I found: - > First, the TallyEngine customization for P. falciparum states the following: > # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene plasmodiumfalciparum_query_level0=select > count(*) from genenametype where type = 'ORF'; plasmodiumfalciparum_query_level1=select > count(*) from genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered > Locus plasmodiumfalciparum_table_name_level1=UniGene Thus, what is being > counted by TallyEngine as "Ordered Locus" are the gene names whose type is > 'ORF' ("level0" properties). - Now, this is what the P. falciparum species > profile does when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + "where > (c.value similar to ? " + "or c.value similar to ? " + > "or c.value similar to ?) " + "and type <> 'ordered locus > names' " + "and type <> 'ORF' " + "group by > d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last > line --- the query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally Engine > here, since, the export puts only one record in OrderedLocusNames) Still, > comparing these two queries directly against the PostgreSQL database would > be educational, I think. Then, knowing which criteria are correct, the > appropriate action can then be taken, I think. Hope this helps... John > David N. Dionisio, PhD Associate Professor, Computer Science Loyola > Marymount University > On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > Debug export is still > going... 2.5GB of log files so far with progress at 65%... > > I posted > the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > Hi all, > > Have been working through several > Plasmodium gdb exports in an attempt to source why only one gene id makes it > into the Ordered Locus table. > > I have reviewed the logger file while > set to "WARN" and wasn't able to determine anything which would suggest an > error. I will post this log file to the wiki later today when I get home. > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > Unless a better > suggestion is offered I'm going to rerun an export again with 'DEBUG" > verbosity and up the file sizes to near 1 GB each and hope that 3 GB total > will be enough to hold the complete export log. > > More info as it > comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam > Dahlquist <kda...@lm...> wrote: > Hi, > > I've completed testing the > Plasmodium gdb I exported last November and updated the SourceForge wiki. > > > Plasmodium has it's own task list page, which I've updated here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > Here is the list of bugs/action items that I've listed: > > 1. The > OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the > TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > 2. The GeneId table in the database has 6 fewer > IDs than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with the > TallyEngine). > > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell at this > point whether this is due to a change in annotation by UniProt or is a bug > with GenMAPP Builder. > > Thanks, > Kam > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > This paper > provides an overview of data connectivity, details > its effect on > application quality, and explores various alternative > solutions. > http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > xmlpipedb-developer > mailing list > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting A question and answer guide to determining > the best fit for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d _______________________________________________ > xmlpipedb-developer mailing list xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-03-18 20:04:51
|
Hi, I think we need to capture all of the IDs in the ORF tag and *NOT* do the pattern match at all. As far as I can tell with my analysis of the IDs in the query you posted, we need to keep them all, so we don't actually need to specify the patterns at this point. I would rather do that thinking towards the future when the Plasmodium people might add new patterns to the ID system. I believe that the "MAL[0-9]*P1.[0-9]*" pattern is also not pulling out everything it needs to. But instead of including more patterns, I would rather just loosen up the criteria to include all things in the ORF tag. Also, I just want to be clear about the underscore issue. That only affects IDs that begin with PFA, not the other IDs that begin with PF##_ Thanks, Kam At 12:47 PM 3/18/2011, Richard Brous wrote: >I'm down in the bio lab at the moment looking at this. > >I understand what needs to be done in regards to 1) keeping all ORF >id's and then 2) querying the id's with underscores to then remove >the underscores but maintaining the original underscore id's. > >I performed an export with the pattern match as-is and commented out >the exclude 'ordered locus' and 'ORF'. The export completed but with >only 5110 gene id's. So it seems we are missing 235 gene id's that >are in the XML file as seen from the raw sql query. > >Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I >went ahead and added it into the pattern match string and called it >as the others are called in the query. Once the export completes >I'll confirm that in fact we have captured all the id's. Once >confirmed, I will move onto the query to find id's with underscores >and handle them as mentioned above. > >Richard >On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous ><<mailto:rbr...@gm...>rbr...@gm...> wrote: >Thanks for info. I will dig into this after my 10 am exam tomorrow. > >Richard > >Sent from my iPhone > >On Mar 16, 2011, at 4:18 PM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: > >>Hi, >> >>More information on the underscore issue: >> >>There is an ID with the pattern >> >>PFA_[0-9][0-9][0-9][0-9][wc] >> >>that needs to have the underscore removed so that reads instead >> >>PFA[0-9][0-9][0-9][0-9][wc] >> >>I don't know why these IDs exist in UniProt, but in PlasmoDB, they >>are there without the underscore and won't be recognized with >>it. I think we should leave the underscore ones there, but also >>have a set without the underscore. There are 134 records that have this issue. >> >>If Rich can make these two fixes (capturing the ORFs and dealing >>with the underscore), then I think we will be good to go with >>Plasmodium. There may be code in the Vibrio or Helicobacter >>profiles to help with the underscores, but I'm not sure. >> >>Best, >>Kam >> >>At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >>>Hi, >>> >>>I've taken a look at the list of IDs and did a quick comparison >>>with both the older released gdb and also a list I downloaded from >>>the Broad Institute Plasmodium database. I think we can safely go >>>with the query on the ORF tag for our export--all of those >>>different ID forms are valid. There are about 400 IDs that are >>>different in the older released gdb than in the new query; I'm >>>going to further investigate those. I suspect that the difference >>>is mainly due to a +/- underscore issue that we might need to >>>solve. However, we should go forward with capturing all the IDs >>>from the ORF tag, I don't see a need to restrict to a particular pattern there. >>> >>>Best, >>>Kam >>> >>>At 09:48 PM 3/14/2011, you wrote: >>>>Hi all, >>>> >>>>So I went ahead and did raw sql queries of the Postgres data and >>>>turned up the following: >>>> >>>>select * from genenametype where type = 'ordered locus' >>>>Returned zero gene ids >>>> >>>>select * from genenametype where type = 'ORF' >>>>Returned 5345 gene ids >>>>The type = 'ORF' query was exported into excel and posted to the >>>>biodb wiki on the Spring 2011 Plasmodium page. >>>> >>>>There are many many patterns in regards to gene ids, here the the >>>>prefixes from my cursory look: >>>>MAL >>>>PF##_ >>>>PFA >>>>PFB >>>>PFC >>>>PFD >>>>PFE >>>>PFF >>>>PFI >>>>PFL >>>> >>>>Richard >>>> >>>> >>>>On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist >>>><<mailto:kda...@lm...>kda...@lm...> wrote: >>>>Hi, >>>>I looked up an assortment of IDs in UniProt and I can confirm >>>>that it appears that the IDs are found in the ORF tag, not the >>>>OrderedLocus tag (except for the one that got captured in the export). >>>>Best, >>>>Kam >>>>At 08:09 AM 3/14/2011, you wrote: >>>>>Thanks Dondi, >>>>> >>>>>Will review this after our call today. I have been a little >>>>>worried as the DEBUG export has been going for 2.5 days with >>>>>progress at 65% and 6.5 Gb of log files so far... /yikes >>>>> >>>>>Btw I have a work lunch meeting in Beverly Hills today so will >>>>>be working from home afterwards instead of in the bio lab. >>>>> >>>>>Richard >>>>>On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio >>>>><<mailto:do...@lm...>do...@lm...> wrote: >>>>>Thanks for the updates, Rich. >>>>>I gave things a once-over and may have a lead. Here is what I found: >>>>>- First, the TallyEngine customization for P. falciparum states >>>>>the following: >>>>># Plasmodium falciparum >>>>>plasmodiumfalciparum_level_amount=2 >>>>>plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>>>plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>>>plasmodiumfalciparum_query_level0=select count(*) from >>>>>genenametype where type = 'ORF'; >>>>>plasmodiumfalciparum_query_level1=select count(*) from >>>>>genenametype where type = 'UniGene'; >>>>>plasmodiumfalciparum_table_name_level0=Ordered Locus >>>>>plasmodiumfalciparum_table_name_level1=UniGene >>>>>Thus, what is being counted by TallyEngine as "Ordered Locus" >>>>>are the gene names whose type is 'ORF' ("level0" properties). >>>>>- Now, this is what the P. falciparum species profile does when >>>>>harvesting IDs >>>>>(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>>> >>>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, >>>>> c.value " + >>>>> "from genenametype c inner join entrytype_genetype d " + >>>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>>> "where (c.value similar to ? " + >>>>> "or c.value similar to ? " + >>>>> "or c.value similar to ?) " + >>>>> "and type <> 'ordered locus names' " + >>>>> "and type <> 'ORF' " + >>>>> "group by d.entrytype_gene_hjid, c.value"; >>>>>Note the condition on the second-to-last line --- the query >>>>>actually *omits* gene names whose type is 'ORF'! So the >>>>>question is...which is right? (I'm inclined to believe the >>>>>Tally Engine here, since, the export puts only one record in >>>>>OrderedLocusNames) >>>>>Still, comparing these two queries directly against the >>>>>PostgreSQL database would be educational, I think. Then, >>>>>knowing which criteria are correct, the appropriate action can >>>>>then be taken, I think. >>>>>Hope this helps... >>>>>John David N. Dionisio, PhD >>>>>Associate Professor, Computer Science >>>>>Loyola Marymount University >>>>>On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>>> > Debug export is still going... 2.5GB of log files so far with >>>>> progress at 65%... >>>>> > >>>>> > I posted the link of the WARN log on the plasmodium page >>>>> here: >>>>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>> . >>>>> > Richard >>>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous >>>>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>>>> > Hi all, >>>>> > >>>>> > Have been working through several Plasmodium gdb exports in >>>>> an attempt to source why only one gene id makes it into the >>>>> Ordered Locus table. >>>>> > >>>>> > I have reviewed the logger file while set to "WARN" and >>>>> wasn't able to determine anything which would suggest an error. >>>>> I will post this log file to the wiki later today when I get home. >>>>> > >>>>> > I then upped the logger verbosity to "DEBUG" and file size to >>>>> 100MB with hopes that more detail will surface the issue, but >>>>> my export is on hour 20 and still going (although its nearly >>>>> complete). What I didn't expect was the size of the log files >>>>> and that it seems only the last 3 are kept with earlier logs >>>>> being overwritten =( I fear that the information I need it in >>>>> one of the earlier files which are now lost. >>>>> > >>>>> > Unless a better suggestion is offered I'm going to rerun an >>>>> export again with 'DEBUG" verbosity and up the file sizes to >>>>> near 1 GB each and hope that 3 GB total will be enough to hold >>>>> the complete export log. >>>>> > >>>>> > More info as it comes... >>>>> > >>>>> > Richard >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist >>>>> <<mailto:kda...@lm...>kda...@lm...> wrote: >>>>> > Hi, >>>>> > >>>>> > I've completed testing the Plasmodium gdb I exported last >>>>> November and updated the SourceForge wiki. >>>>> > >>>>> > Plasmodium has it's own task list page, which I've updated >>>>> here: >>>>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>>> >>>>> > >>>>> > The testing report can be found >>>>> here: >>>>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>>> >>>>> > >>>>> > The source files and gdb are on a new Plasmodium falciparum >>>>> page on the Fall 2010 BiolDB >>>>> wiki: >>>>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>>> >>>>> > >>>>> > Here is the list of bugs/action items that I've listed: >>>>> > >>>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out >>>>> of 5345 repored by the TallyEngine. This also affects all other >>>>> tables related to OrderedLocusNames. >>>>> > >>>>> > 2. The GeneId table in the database has 6 fewer IDs than >>>>> reported by the TallyEngine (Mycobacterium smegmatis and >>>>> Mycobacterium tuberculosis also have mysterious GeneId issues >>>>> with the TallyEngine). >>>>> > >>>>> > 3. The count for EMBL IDs in the gdb also seems low, it's >>>>> lower than the 2009 version of the gdb. There's no way to tell >>>>> at this point whether this is due to a change in annotation by >>>>> UniProt or is a bug with GenMAPP Builder. >>>>> > >>>>> > Thanks, >>>>> > Kam >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------ >>>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>>> > This paper provides an overview of data connectivity, details >>>>> > its effect on application quality, and explores various alternative >>>>> > solutions. >>>>> <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d >>>>> > _______________________________________________ >>>>> > xmlpipedb-developer mailing list >>>>> > >>>>> <mailto:xml...@li...>xml...@li... >>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>> > >>>>> > >>>>> > >>>>> > <ATT00001..txt><ATT00002..txt> >>>>>------------------------------------------------------------------------------ >>>>> >>>>>Colocation vs. Managed Hosting >>>>>A question and answer guide to determining the best fit >>>>>for your organization - today and in the future. >>>>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>>>>_______________________________________________ >>>>>xmlpipedb-developer mailing list >>>>><mailto:xml...@li...>xml...@li... >>>>> >>>>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>>------------------------------------------------------------------------------ >>>> >>>>Colocation vs. Managed Hosting >>>>A question and answer guide to determining the best fit >>>>for your organization - today and in the future. >>>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>>>_______________________________________________ >>>>xmlpipedb-developer mailing list >>>><mailto:xml...@li...>xml...@li... >>>> >>>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> >>------------------------------------------------------------------------------ >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |
From: Richard B. <rbr...@gm...> - 2011-03-18 19:47:35
|
I'm down in the bio lab at the moment looking at this. I understand what needs to be done in regards to 1) keeping all ORF id's and then 2) querying the id's with underscores to then remove the underscores but maintaining the original underscore id's. I performed an export with the pattern match as-is and commented out the exclude 'ordered locus' and 'ORF'. The export completed but with only 5110 gene id's. So it seems we are missing 235 gene id's that are in the XML file as seen from the raw sql query. Based on Dr. D's analysis of a missing pattern of PFA_####[aw] I went ahead and added it into the pattern match string and called it as the others are called in the query. Once the export completes I'll confirm that in fact we have captured all the id's. Once confirmed, I will move onto the query to find id's with underscores and handle them as mentioned above. Richard On Thu, Mar 17, 2011 at 7:34 AM, Richard Brous <rbr...@gm...> wrote: > Thanks for info. I will dig into this after my 10 am exam tomorrow. > > Richard > > Sent from my iPhone > > On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > > Hi, > > More information on the underscore issue: > > There is an ID with the pattern > > PFA_[0-9][0-9][0-9][0-9][wc] > > that needs to have the underscore removed so that reads instead > > PFA[0-9][0-9][0-9][0-9][wc] > > I don't know why these IDs exist in UniProt, but in PlasmoDB, they are > there without the underscore and won't be recognized with it. I think we > should leave the underscore ones there, but also have a set without the > underscore. There are 134 records that have this issue. > > If Rich can make these two fixes (capturing the ORFs and dealing with the > underscore), then I think we will be good to go with Plasmodium. There may > be code in the Vibrio or Helicobacter profiles to help with the underscores, > but I'm not sure. > > Best, > Kam > > At 02:59 PM 3/16/2011, Kam Dahlquist wrote: > > Hi, > > I've taken a look at the list of IDs and did a quick comparison with both > the older released gdb and also a list I downloaded from the Broad Institute > Plasmodium database. I think we can safely go with the query on the ORF tag > for our export--all of those different ID forms are valid. There are about > 400 IDs that are different in the older released gdb than in the new query; > I'm going to further investigate those. I suspect that the difference is > mainly due to a +/- underscore issue that we might need to solve. However, > we should go forward with capturing all the IDs from the ORF tag, I don't > see a need to restrict to a particular pattern there. > > Best, > Kam > > At 09:48 PM 3/14/2011, you wrote: > > Hi all, > > So I went ahead and did raw sql queries of the Postgres data and turned up > the following: > > select * from genenametype where type = 'ordered locus' > Returned zero gene ids > > select * from genenametype where type = 'ORF' > Returned 5345 gene ids > The type = 'ORF' query was exported into excel and posted to the biodb wiki > on the Spring 2011 Plasmodium page. > > There are many many patterns in regards to gene ids, here the the prefixes > from my cursory look: > MAL > PF##_ > PFA > PFB > PFC > PFD > PFE > PFF > PFI > PFL > > Richard > > > On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> > wrote: Hi, > I looked up an assortment of IDs in UniProt and I can confirm that it > appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > Best, Kam > > At 08:09 AM 3/14/2011, you wrote: > > Thanks Dondi, Will review this after our call today. I have been a > little worried as the DEBUG export has been going for 2.5 days with progress > at 65% and 6.5 Gb of log files so far... /yikes Btw I have a work lunch > meeting in Beverly Hills today so will be working from home afterwards > instead of in the bio lab. Richard > On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> > wrote: Thanks for the updates, Rich. I gave things a once-over and may > have a lead. Here is what I found: - First, the TallyEngine customization > for P. falciparum states the following: > # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene plasmodiumfalciparum_query_level0=select > count(*) from genenametype where type = 'ORF'; plasmodiumfalciparum_query_level1=select > count(*) from genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered > Locus plasmodiumfalciparum_table_name_level1=UniGene > Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene > names whose type is 'ORF' ("level0" properties). - Now, this is what the > P. falciparum species profile does when harvesting IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + "where > (c.value similar to ? " + "or c.value similar to ? " + > "or c.value similar to ?) " + "and type <> 'ordered locus > names' " + "and type <> 'ORF' " + "group by > d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last > line --- the query actually *omits* gene names whose type is 'ORF'! So the > question is...which is right? (I'm inclined to believe the Tally Engine > here, since, the export puts only one record in OrderedLocusNames) Still, > comparing these two queries directly against the PostgreSQL database would > be educational, I think. Then, knowing which criteria are correct, the > appropriate action can then be taken, I think. > Hope this helps... John David N. Dionisio, PhD Associate Professor, > Computer Science Loyola Marymount University > > On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > Debug export is still > going... 2.5GB of log files so far with progress at 65%... > > I posted > the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous < > rbr...@gm...> wrote: > Hi all, > > Have been working through several > Plasmodium gdb exports in an attempt to source why only one gene id makes it > into the Ordered Locus table. > > I have reviewed the logger file while > set to "WARN" and wasn't able to determine anything which would suggest an > error. I will post this log file to the wiki later today when I get home. > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > Unless a better > suggestion is offered I'm going to rerun an export again with 'DEBUG" > verbosity and up the file sizes to near 1 GB each and hope that 3 GB total > will be enough to hold the complete export log. > > More info as it > comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam > Dahlquist <kda...@lm...> wrote: > Hi, > > I've completed testing the > Plasmodium gdb I exported last November and updated the SourceForge wiki. > > > Plasmodium has it's own task list page, which I've updated here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > Here is the list of bugs/action items that I've listed: > > 1. The > OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the > TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > 2. The GeneId table in the database has 6 fewer > IDs than reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with the > TallyEngine). > > 3. The count for EMBL IDs in the gdb also seems low, > it's lower than the 2009 version of the gdb. There's no way to tell at this > point whether this is due to a change in annotation by UniProt or is a bug > with GenMAPP Builder. > > Thanks, > Kam > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > This paper > provides an overview of data connectivity, details > its effect on > application quality, and explores various alternative > solutions. > http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > xmlpipedb-developer > mailing list > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting A question and answer guide to determining > the best fit for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d _______________________________________________ > xmlpipedb-developer mailing list xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting A question and answer guide to determining > the best fit for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d _______________________________________________ > xmlpipedb-developer mailing list xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Richard B. <rbr...@gm...> - 2011-03-17 15:35:00
|
Thanks for info. I will dig into this after my 10 am exam tomorrow. Richard Sent from my iPhone On Mar 16, 2011, at 4:18 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > More information on the underscore issue: > > There is an ID with the pattern > > PFA_[0-9][0-9][0-9][0-9][wc] > > that needs to have the underscore removed so that reads instead > > PFA[0-9][0-9][0-9][0-9][wc] > > I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. > > If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. > > Best, > Kam > > At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >> Hi, >> >> I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. >> >> Best, >> Kam >> >> At 09:48 PM 3/14/2011, you wrote: >>> Hi all, >>> >>> So I went ahead and did raw sql queries of the Postgres data and turned up the following: >>> >>> select * from genenametype where type = 'ordered locus' >>> Returned zero gene ids >>> >>> select * from genenametype where type = 'ORF' >>> Returned 5345 gene ids >>> The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. >>> >>> There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: >>> MAL >>> PF##_ >>> PFA >>> PFB >>> PFC >>> PFD >>> PFE >>> PFF >>> PFI >>> PFL >>> >>> Richard >>> >>> >>> On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: >>> Hi, >>> I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). >>> Best, >>> Kam >>> >>> At 08:09 AM 3/14/2011, you wrote: >>>> >>>> Thanks Dondi, >>>> >>>> Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes >>>> >>>> Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. >>>> >>>> Richard >>>> On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> wrote: >>>> Thanks for the updates, Rich. >>>> I gave things a once-over and may have a lead. Here is what I found: >>>> - First, the TallyEngine customization for P. falciparum states the following: >>>> # Plasmodium falciparum >>>> plasmodiumfalciparum_level_amount=2 >>>> plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>> plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>> plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; >>>> plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; >>>> plasmodiumfalciparum_table_name_level0=Ordered Locus >>>> plasmodiumfalciparum_table_name_level1=UniGene >>>> Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). >>>> - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + >>>> "from genenametype c inner join entrytype_genetype d " + >>>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>>> "where (c.value similar to ? " + >>>> "or c.value similar to ? " + >>>> "or c.value similar to ?) " + >>>> "and type <> 'ordered locus names' " + >>>> "and type <> 'ORF' " + >>>> "group by d.entrytype_gene_hjid, c.value"; >>>> Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) >>>> Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. >>>> Hope this helps... >>>> John David N. Dionisio, PhD >>>> Associate Professor, Computer Science >>>> Loyola Marymount University >>>> >>>> On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>>> > Debug export is still going... 2.5GB of log files so far with progress at 65%... >>>> > >>>> > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . >>>> > Richard >>>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: >>>> > Hi all, >>>> > >>>> > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. >>>> > >>>> > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. >>>> > >>>> > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. >>>> > >>>> > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. >>>> > >>>> > More info as it comes... >>>> > >>>> > Richard >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: >>>> > Hi, >>>> > >>>> > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. >>>> > >>>> > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>>> > >>>> > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>>> > >>>> > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>>> > >>>> > Here is the list of bugs/action items that I've listed: >>>> > >>>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. >>>> > >>>> > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). >>>> > >>>> > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. >>>> > >>>> > Thanks, >>>> > Kam >>>> > >>>> > >>>> > ------------------------------------------------------------------------------ >>>> > What You Don't Know About Data Connectivity CAN Hurt You >>>> > This paper provides an overview of data connectivity, details >>>> > its effect on application quality, and explores various alternative >>>> > solutions. http://p.sf.net/sfu/progress-d2d >>>> > _______________________________________________ >>>> > xmlpipedb-developer mailing list >>>> > xml...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> > >>>> > >>>> > >>>> > <ATT00001..txt><ATT00002..txt> >>>> ------------------------------------------------------------------------------ >>>> Colocation vs. Managed Hosting >>>> A question and answer guide to determining the best fit >>>> for your organization - today and in the future. >>>> http://p.sf.net/sfu/internap-sfd2d >>>> _______________________________________________ >>>> xmlpipedb-developer mailing list >>>> xml...@li... >>>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>>> >>> >>> ------------------------------------------------------------------------------ >>> Colocation vs. Managed Hosting >>> A question and answer guide to determining the best fit >>> for your organization - today and in the future. >>> http://p.sf.net/sfu/internap-sfd2d >>> _______________________________________________ >>> xmlpipedb-developer mailing list >>> xml...@li... >>> https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>> >> > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer |
From: Kam D. <kda...@lm...> - 2011-03-16 23:18:55
|
Hi, More information on the underscore issue: There is an ID with the pattern PFA_[0-9][0-9][0-9][0-9][wc] that needs to have the underscore removed so that reads instead PFA[0-9][0-9][0-9][0-9][wc] I don't know why these IDs exist in UniProt, but in PlasmoDB, they are there without the underscore and won't be recognized with it. I think we should leave the underscore ones there, but also have a set without the underscore. There are 134 records that have this issue. If Rich can make these two fixes (capturing the ORFs and dealing with the underscore), then I think we will be good to go with Plasmodium. There may be code in the Vibrio or Helicobacter profiles to help with the underscores, but I'm not sure. Best, Kam At 02:59 PM 3/16/2011, Kam Dahlquist wrote: >Hi, > >I've taken a look at the list of IDs and did a quick comparison with >both the older released gdb and also a list I downloaded from the >Broad Institute Plasmodium database. I think we can safely go with >the query on the ORF tag for our export--all of those different ID >forms are valid. There are about 400 IDs that are different in the >older released gdb than in the new query; I'm going to further >investigate those. I suspect that the difference is mainly due to a >+/- underscore issue that we might need to solve. However, we >should go forward with capturing all the IDs from the ORF tag, I >don't see a need to restrict to a particular pattern there. > >Best, >Kam > >At 09:48 PM 3/14/2011, you wrote: >>Hi all, >> >>So I went ahead and did raw sql queries of the Postgres data and >>turned up the following: >> >>select * from genenametype where type = 'ordered locus' >>Returned zero gene ids >> >>select * from genenametype where type = 'ORF' >>Returned 5345 gene ids >>The type = 'ORF' query was exported into excel and posted to the >>biodb wiki on the Spring 2011 Plasmodium page. >> >>There are many many patterns in regards to gene ids, here the the >>prefixes from my cursory look: >>MAL >>PF##_ >>PFA >>PFB >>PFC >>PFD >>PFE >>PFF >>PFI >>PFL >> >>Richard >> >> >>On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist >><<mailto:kda...@lm...>kda...@lm...> wrote: >>Hi, >>I looked up an assortment of IDs in UniProt and I can confirm that >>it appears that the IDs are found in the ORF tag, not the >>OrderedLocus tag (except for the one that got captured in the export). >>Best, >>Kam >> >>At 08:09 AM 3/14/2011, you wrote: >>>Thanks Dondi, >>> >>>Will review this after our call today. I have been a little >>>worried as the DEBUG export has been going for 2.5 days with >>>progress at 65% and 6.5 Gb of log files so far... /yikes >>> >>>Btw I have a work lunch meeting in Beverly Hills today so will be >>>working from home afterwards instead of in the bio lab. >>> >>>Richard >>>On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio >>><<mailto:do...@lm...>do...@lm...> wrote: >>>Thanks for the updates, Rich. >>>I gave things a once-over and may have a lead. Here is what I found: >>>- First, the TallyEngine customization for P. falciparum states >>>the following: >>># Plasmodium falciparum >>>plasmodiumfalciparum_level_amount=2 >>>plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>>plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>>plasmodiumfalciparum_query_level0=select count(*) from >>>genenametype where type = 'ORF'; >>>plasmodiumfalciparum_query_level1=select count(*) from >>>genenametype where type = 'UniGene'; >>>plasmodiumfalciparum_table_name_level0=Ordered Locus >>>plasmodiumfalciparum_table_name_level1=UniGene >>>Thus, what is being counted by TallyEngine as "Ordered Locus" are >>>the gene names whose type is 'ORF' ("level0" properties). >>>- Now, this is what the P. falciparum species profile does when >>>harvesting IDs >>>(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >>> String sqlQuery = "select d.entrytype_gene_hjid as hjid, >>> c.value " + >>> "from genenametype c inner join entrytype_genetype d " + >>> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >>> "where (c.value similar to ? " + >>> "or c.value similar to ? " + >>> "or c.value similar to ?) " + >>> "and type <> 'ordered locus names' " + >>> "and type <> 'ORF' " + >>> "group by d.entrytype_gene_hjid, c.value"; >>>Note the condition on the second-to-last line --- the query >>>actually *omits* gene names whose type is 'ORF'! So the question >>>is...which is right? (I'm inclined to believe the Tally Engine >>>here, since, the export puts only one record in OrderedLocusNames) >>>Still, comparing these two queries directly against the PostgreSQL >>>database would be educational, I think. Then, knowing which >>>criteria are correct, the appropriate action can then be taken, I think. >>>Hope this helps... >>>John David N. Dionisio, PhD >>>Associate Professor, Computer Science >>>Loyola Marymount University >>> >>>On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >>> > Debug export is still going... 2.5GB of log files so far with >>> progress at 65%... >>> > >>> > I posted the link of the WARN log on the plasmodium page here: >>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>> . >>> > Richard >>> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous >>> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >>> > Hi all, >>> > >>> > Have been working through several Plasmodium gdb exports in an >>> attempt to source why only one gene id makes it into the Ordered Locus table. >>> > >>> > I have reviewed the logger file while set to "WARN" and wasn't >>> able to determine anything which would suggest an error. I will >>> post this log file to the wiki later today when I get home. >>> > >>> > I then upped the logger verbosity to "DEBUG" and file size to >>> 100MB with hopes that more detail will surface the issue, but my >>> export is on hour 20 and still going (although its nearly >>> complete). What I didn't expect was the size of the log files and >>> that it seems only the last 3 are kept with earlier logs being >>> overwritten =( I fear that the information I need it in one of >>> the earlier files which are now lost. >>> > >>> > Unless a better suggestion is offered I'm going to rerun an >>> export again with 'DEBUG" verbosity and up the file sizes to near >>> 1 GB each and hope that 3 GB total will be enough to hold the >>> complete export log. >>> > >>> > More info as it comes... >>> > >>> > Richard >>> > >>> > >>> > >>> > >>> > >>> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist >>> <<mailto:kda...@lm...>kda...@lm...> wrote: >>> > Hi, >>> > >>> > I've completed testing the Plasmodium gdb I exported last >>> November and updated the SourceForge wiki. >>> > >>> > Plasmodium has it's own task list page, which I've updated >>> here: >>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >>> >>> > >>> > The testing report can be found >>> here: >>> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >>> >>> > >>> > The source files and gdb are on a new Plasmodium falciparum >>> page on the Fall 2010 BiolDB >>> wiki: >>> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >>> >>> > >>> > Here is the list of bugs/action items that I've listed: >>> > >>> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of >>> 5345 repored by the TallyEngine. This also affects all other >>> tables related to OrderedLocusNames. >>> > >>> > 2. The GeneId table in the database has 6 fewer IDs than >>> reported by the TallyEngine (Mycobacterium smegmatis and >>> Mycobacterium tuberculosis also have mysterious GeneId issues >>> with the TallyEngine). >>> > >>> > 3. The count for EMBL IDs in the gdb also seems low, it's >>> lower than the 2009 version of the gdb. There's no way to tell at >>> this point whether this is due to a change in annotation by >>> UniProt or is a bug with GenMAPP Builder. >>> > >>> > Thanks, >>> > Kam >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > What You Don't Know About Data Connectivity CAN Hurt You >>> > This paper provides an overview of data connectivity, details >>> > its effect on application quality, and explores various alternative >>> > solutions. >>> <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d >>> > _______________________________________________ >>> > xmlpipedb-developer mailing list >>> > >>> <mailto:xml...@li...>xml...@li... >>> >>> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> > >>> > >>> > >>> > <ATT00001..txt><ATT00002..txt> >>>------------------------------------------------------------------------------ >>> >>>Colocation vs. Managed Hosting >>>A question and answer guide to determining the best fit >>>for your organization - today and in the future. >>><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>>_______________________________________________ >>>xmlpipedb-developer mailing list >>><mailto:xml...@li...>xml...@li... >>> >>>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >>> >>------------------------------------------------------------------------------ >> >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >> >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > |
From: Kam D. <kda...@lm...> - 2011-03-16 21:59:58
|
Hi, I've taken a look at the list of IDs and did a quick comparison with both the older released gdb and also a list I downloaded from the Broad Institute Plasmodium database. I think we can safely go with the query on the ORF tag for our export--all of those different ID forms are valid. There are about 400 IDs that are different in the older released gdb than in the new query; I'm going to further investigate those. I suspect that the difference is mainly due to a +/- underscore issue that we might need to solve. However, we should go forward with capturing all the IDs from the ORF tag, I don't see a need to restrict to a particular pattern there. Best, Kam At 09:48 PM 3/14/2011, you wrote: >Hi all, > >So I went ahead and did raw sql queries of the Postgres data and >turned up the following: > >select * from genenametype where type = 'ordered locus' >Returned zero gene ids > >select * from genenametype where type = 'ORF' >Returned 5345 gene ids >The type = 'ORF' query was exported into excel and posted to the >biodb wiki on the Spring 2011 Plasmodium page. > >There are many many patterns in regards to gene ids, here the the >prefixes from my cursory look: >MAL >PF##_ >PFA >PFB >PFC >PFD >PFE >PFF >PFI >PFL > >Richard > > >On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist ><<mailto:kda...@lm...>kda...@lm...> wrote: >Hi, > >I looked up an assortment of IDs in UniProt and I can confirm that >it appears that the IDs are found in the ORF tag, not the >OrderedLocus tag (except for the one that got captured in the export). > >Best, >Kam > > >At 08:09 AM 3/14/2011, you wrote: >>Thanks Dondi, >> >>Will review this after our call today. I have been a little worried >>as the DEBUG export has been going for 2.5 days with progress at >>65% and 6.5 Gb of log files so far... /yikes >> >>Btw I have a work lunch meeting in Beverly Hills today so will be >>working from home afterwards instead of in the bio lab. >> >>Richard >> >>On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio >><<mailto:do...@lm...>do...@lm...> wrote: >>Thanks for the updates, Rich. >>I gave things a once-over and may have a lead. Here is what I found: >>- First, the TallyEngine customization for P. falciparum states the >>following: >> >># Plasmodium falciparum >>plasmodiumfalciparum_level_amount=2 >>plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >>plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene >>plasmodiumfalciparum_query_level0=select count(*) from genenametype >>where type = 'ORF'; >>plasmodiumfalciparum_query_level1=select count(*) from genenametype >>where type = 'UniGene'; >>plasmodiumfalciparum_table_name_level0=Ordered Locus >>plasmodiumfalciparum_table_name_level1=UniGene >> >>Thus, what is being counted by TallyEngine as "Ordered Locus" are >>the gene names whose type is 'ORF' ("level0" properties). >>- Now, this is what the P. falciparum species profile does when >>harvesting IDs >>(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): >> >> String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + >> "from genenametype c inner join entrytype_genetype d " + >> "on (c.entrytype_genetype_name_hjid = d.hjid) " + >> "where (c.value similar to ? " + >> "or c.value similar to ? " + >> "or c.value similar to ?) " + >> "and type <> 'ordered locus names' " + >> "and type <> 'ORF' " + >> "group by d.entrytype_gene_hjid, c.value"; >> >>Note the condition on the second-to-last line --- the query >>actually *omits* gene names whose type is 'ORF'! So the question >>is...which is right? (I'm inclined to believe the Tally Engine >>here, since, the export puts only one record in OrderedLocusNames) >>Still, comparing these two queries directly against the PostgreSQL >>database would be educational, I think. Then, knowing which >>criteria are correct, the appropriate action can then be taken, I think. >> >>Hope this helps... >>John David N. Dionisio, PhD >>Associate Professor, Computer Science >>Loyola Marymount University >> >> >>On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: >> > Debug export is still going... 2.5GB of log files so far with >> progress at 65%... >> > >> > I posted the link of the WARN log on the plasmodium page here: >> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >> . >> > Richard >> > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous >> <<mailto:rbr...@gm...>rbr...@gm...> wrote: >> > Hi all, >> > >> > Have been working through several Plasmodium gdb exports in an >> attempt to source why only one gene id makes it into the Ordered Locus table. >> > >> > I have reviewed the logger file while set to "WARN" and wasn't >> able to determine anything which would suggest an error. I will >> post this log file to the wiki later today when I get home. >> > >> > I then upped the logger verbosity to "DEBUG" and file size to >> 100MB with hopes that more detail will surface the issue, but my >> export is on hour 20 and still going (although its nearly >> complete). What I didn't expect was the size of the log files and >> that it seems only the last 3 are kept with earlier logs being >> overwritten =( I fear that the information I need it in one of the >> earlier files which are now lost. >> > >> > Unless a better suggestion is offered I'm going to rerun an >> export again with 'DEBUG" verbosity and up the file sizes to near >> 1 GB each and hope that 3 GB total will be enough to hold the >> complete export log. >> > >> > More info as it comes... >> > >> > Richard >> > >> > >> > >> > >> > >> > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist >> <<mailto:kda...@lm...>kda...@lm...> wrote: >> > Hi, >> > >> > I've completed testing the Plasmodium gdb I exported last >> November and updated the SourceForge wiki. >> > >> > Plasmodium has it's own task list page, which I've updated >> here: >> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List >> >> > >> > The testing report can be found >> here: >> <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 >> >> > >> > The source files and gdb are on a new Plasmodium falciparum page >> on the Fall 2010 BiolDB >> wiki: >> <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum >> >> > >> > Here is the list of bugs/action items that I've listed: >> > >> > 1. The OrderedLocusNames table in the gdb only has 1 ID out of >> 5345 repored by the TallyEngine. This also affects all other >> tables related to OrderedLocusNames. >> > >> > 2. The GeneId table in the database has 6 fewer IDs than >> reported by the TallyEngine (Mycobacterium smegmatis and >> Mycobacterium tuberculosis also have mysterious GeneId issues with >> the TallyEngine). >> > >> > 3. The count for EMBL IDs in the gdb also seems low, it's lower >> than the 2009 version of the gdb. There's no way to tell at this >> point whether this is due to a change in annotation by UniProt or >> is a bug with GenMAPP Builder. >> > >> > Thanks, >> > Kam >> > >> > >> > >> ------------------------------------------------------------------------------ >> > What You Don't Know About Data Connectivity CAN Hurt You >> > This paper provides an overview of data connectivity, details >> > its effect on application quality, and explores various alternative >> > solutions. >> <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d >> > _______________________________________________ >> > xmlpipedb-developer mailing list >> > >> <mailto:xml...@li...>xml...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> > >> > >> > >> > <ATT00001..txt><ATT00002..txt> >> >>------------------------------------------------------------------------------ >> >>Colocation vs. Managed Hosting >>A question and answer guide to determining the best fit >>for your organization - today and in the future. >><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >>_______________________________________________ >>xmlpipedb-developer mailing list >><mailto:xml...@li...>xml...@li... >> >>https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer >> >> > >------------------------------------------------------------------------------ >Colocation vs. Managed Hosting >A question and answer guide to determining the best fit >for your organization - today and in the future. ><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |
From: Richard B. <rbr...@gm...> - 2011-03-15 04:48:57
|
Hi all, So I went ahead and did raw sql queries of the Postgres data and turned up the following: select * from genenametype where type = 'ordered locus' Returned zero gene ids select * from genenametype where type = 'ORF' Returned 5345 gene ids The type = 'ORF' query was exported into excel and posted to the biodb wiki on the Spring 2011 Plasmodium page. There are many many patterns in regards to gene ids, here the the prefixes from my cursory look: MAL PF##_ PFA PFB PFC PFD PFE PFF PFI PFL Richard On Mon, Mar 14, 2011 at 10:32 AM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I looked up an assortment of IDs in UniProt and I can confirm that it > appears that the IDs are found in the ORF tag, not the OrderedLocus tag > (except for the one that got captured in the export). > > Best, > Kam > > > At 08:09 AM 3/14/2011, you wrote: > > Thanks Dondi, > > Will review this after our call today. I have been a little worried as the > DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of > log files so far... /yikes > > Btw I have a work lunch meeting in Beverly Hills today so will be working > from home afterwards instead of in the bio lab. > > Richard > > On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...> > wrote: > Thanks for the updates, Rich. > > I gave things a once-over and may have a lead. Here is what I found: > > - First, the TallyEngine customization for P. falciparum states the > following: > > > # Plasmodium falciparum > plasmodiumfalciparum_level_amount=2 > > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > > plasmodiumfalciparum_query_level0=select count(*) from genenametype where > type = 'ORF'; > plasmodiumfalciparum_query_level1=select count(*) from genenametype where > type = 'UniGene'; > > plasmodiumfalciparum_table_name_level0=Ordered Locus > plasmodiumfalciparum_table_name_level1=UniGene > > > Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene > names whose type is 'ORF' ("level0" properties). > > - Now, this is what the P. falciparum species profile does when harvesting > IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > "where (c.value similar to ? " + > "or c.value similar to ? " + > "or c.value similar to ?) " + > "and type <> 'ordered locus names' " + > "and type <> 'ORF' " + > "group by d.entrytype_gene_hjid, c.value"; > > > Note the condition on the second-to-last line --- the query actually > *omits* gene names whose type is 'ORF'! So the question is...which is > right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > > Still, comparing these two queries directly against the PostgreSQL database > would be educational, I think. Then, knowing which criteria are correct, > the appropriate action can then be taken, I think. > > Hope this helps... > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > > Debug export is still going... 2.5GB of log files so far with progress at > 65%... > > > > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum . > > Richard > > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> > wrote: > > Hi all, > > > > Have been working through several Plasmodium gdb exports in an attempt to > source why only one gene id makes it into the Ordered Locus table. > > > > I have reviewed the logger file while set to "WARN" and wasn't able to > determine anything which would suggest an error. I will post this log file > to the wiki later today when I get home. > > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > > > Unless a better suggestion is offered I'm going to rerun an export again > with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that > 3 GB total will be enough to hold the complete export log. > > > > More info as it comes... > > > > Richard > > > > > > > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> > wrote: > > Hi, > > > > I've completed testing the Plasmodium gdb I exported last November and > updated the SourceForge wiki. > > > > Plasmodium has it's own task list page, which I've updated here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > > Here is the list of bugs/action items that I've listed: > > > > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 > repored by the TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > > > 2. The GeneId table in the database has 6 fewer IDs than reported by the > TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also > have mysterious GeneId issues with the TallyEngine). > > > > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the > 2009 version of the gdb. There's no way to tell at this point whether this > is due to a change in annotation by UniProt or is a bug with GenMAPP > Builder. > > > > Thanks, > > Kam > > > > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > > This paper provides an overview of data connectivity, details > > its effect on application quality, and explores various alternative > > solutions. http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > |
From: Kam D. <kda...@lm...> - 2011-03-14 17:33:06
|
Hi, I looked up an assortment of IDs in UniProt and I can confirm that it appears that the IDs are found in the ORF tag, not the OrderedLocus tag (except for the one that got captured in the export). Best, Kam At 08:09 AM 3/14/2011, you wrote: >Thanks Dondi, > >Will review this after our call today. I have been a little worried >as the DEBUG export has been going for 2.5 days with progress at 65% >and 6.5 Gb of log files so far... /yikes > >Btw I have a work lunch meeting in Beverly Hills today so will be >working from home afterwards instead of in the bio lab. > >Richard > >On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio ><<mailto:do...@lm...>do...@lm...> wrote: >Thanks for the updates, Rich. > >I gave things a once-over and may have a lead. Here is what I found: > >- First, the TallyEngine customization for P. falciparum states the following: > > ># Plasmodium falciparum >plasmodiumfalciparum_level_amount=2 > >plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF >plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > >plasmodiumfalciparum_query_level0=select count(*) from genenametype >where type = 'ORF'; >plasmodiumfalciparum_query_level1=select count(*) from genenametype >where type = 'UniGene'; > >plasmodiumfalciparum_table_name_level0=Ordered Locus >plasmodiumfalciparum_table_name_level1=UniGene > > >Thus, what is being counted by TallyEngine as "Ordered Locus" are >the gene names whose type is 'ORF' ("level0" properties). > >- Now, this is what the P. falciparum species profile does when >harvesting IDs >(PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > "where (c.value similar to ? " + > "or c.value similar to ? " + > "or c.value similar to ?) " + > "and type <> 'ordered locus names' " + > "and type <> 'ORF' " + > "group by d.entrytype_gene_hjid, c.value"; > > >Note the condition on the second-to-last line --- the query actually >*omits* gene names whose type is 'ORF'! So the question is...which >is right? (I'm inclined to believe the Tally Engine here, since, >the export puts only one record in OrderedLocusNames) > >Still, comparing these two queries directly against the PostgreSQL >database would be educational, I think. Then, knowing which >criteria are correct, the appropriate action can then be taken, I think. > >Hope this helps... > >John David N. Dionisio, PhD >Associate Professor, Computer Science >Loyola Marymount University > > > >On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > > Debug export is still going... 2.5GB of log files so far with > progress at 65%... > > > > I posted the link of the WARN log on the plasmodium page here: > <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum. > > Richard > > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous > <<mailto:rbr...@gm...>rbr...@gm...> wrote: > > Hi all, > > > > Have been working through several Plasmodium gdb exports in an > attempt to source why only one gene id makes it into the Ordered Locus table. > > > > I have reviewed the logger file while set to "WARN" and wasn't > able to determine anything which would suggest an error. I will > post this log file to the wiki later today when I get home. > > > > I then upped the logger verbosity to "DEBUG" and file size to > 100MB with hopes that more detail will surface the issue, but my > export is on hour 20 and still going (although its nearly > complete). What I didn't expect was the size of the log files and > that it seems only the last 3 are kept with earlier logs being > overwritten =( I fear that the information I need it in one of the > earlier files which are now lost. > > > > Unless a better suggestion is offered I'm going to rerun an > export again with 'DEBUG" verbosity and up the file sizes to near 1 > GB each and hope that 3 GB total will be enough to hold the > complete export log. > > > > More info as it comes... > > > > Richard > > > > > > > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist > <<mailto:kda...@lm...>kda...@lm...> wrote: > > Hi, > > > > I've completed testing the Plasmodium gdb I exported last > November and updated the SourceForge wiki. > > > > Plasmodium has it's own task list page, which I've updated > here: > <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > > The testing report can be found > here: > <https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115>https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > > The source files and gdb are on a new Plasmodium falciparum page > on the Fall 2010 BiolDB > wiki: > <https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum>https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > > Here is the list of bugs/action items that I've listed: > > > > 1. The OrderedLocusNames table in the gdb only has 1 ID out of > 5345 repored by the TallyEngine. This also affects all other tables > related to OrderedLocusNames. > > > > 2. The GeneId table in the database has 6 fewer IDs than > reported by the TallyEngine (Mycobacterium smegmatis and > Mycobacterium tuberculosis also have mysterious GeneId issues with > the TallyEngine). > > > > 3. The count for EMBL IDs in the gdb also seems low, it's lower > than the 2009 version of the gdb. There's no way to tell at this > point whether this is due to a change in annotation by UniProt or > is a bug with GenMAPP Builder. > > > > Thanks, > > Kam > > > > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > > This paper provides an overview of data connectivity, details > > its effect on application quality, and explores various alternative > > solutions. > <http://p.sf.net/sfu/progress-d2d>http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > > <mailto:xml...@li...>xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > >------------------------------------------------------------------------------ >Colocation vs. Managed Hosting >A question and answer guide to determining the best fit >for your organization - today and in the future. ><http://p.sf.net/sfu/internap-sfd2d>http://p.sf.net/sfu/internap-sfd2d >_______________________________________________ >xmlpipedb-developer mailing list ><mailto:xml...@li...>xml...@li... >https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > |
From: Richard B. <rbr...@gm...> - 2011-03-14 15:09:33
|
Thanks Dondi, Will review this after our call today. I have been a little worried as the DEBUG export has been going for 2.5 days with progress at 65% and 6.5 Gb of log files so far... /yikes Btw I have a work lunch meeting in Beverly Hills today so will be working from home afterwards instead of in the bio lab. Richard On Sun, Mar 13, 2011 at 9:55 PM, John David N. Dionisio <do...@lm...>wrote: > Thanks for the updates, Rich. > > I gave things a once-over and may have a lead. Here is what I found: > > - First, the TallyEngine customization for P. falciparum states the > following: > > > # Plasmodium falciparum > plasmodiumfalciparum_level_amount=2 > > plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF > plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene > > plasmodiumfalciparum_query_level0=select count(*) from genenametype where > type = 'ORF'; > plasmodiumfalciparum_query_level1=select count(*) from genenametype where > type = 'UniGene'; > > plasmodiumfalciparum_table_name_level0=Ordered Locus > plasmodiumfalciparum_table_name_level1=UniGene > > > Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene > names whose type is 'ORF' ("level0" properties). > > - Now, this is what the P. falciparum species profile does when harvesting > IDs > (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): > > > String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + > "from genenametype c inner join entrytype_genetype d " + > "on (c.entrytype_genetype_name_hjid = d.hjid) " + > "where (c.value similar to ? " + > "or c.value similar to ? " + > "or c.value similar to ?) " + > "and type <> 'ordered locus names' " + > "and type <> 'ORF' " + > "group by d.entrytype_gene_hjid, c.value"; > > > Note the condition on the second-to-last line --- the query actually > *omits* gene names whose type is 'ORF'! So the question is...which is > right? (I'm inclined to believe the Tally Engine here, since, the export > puts only one record in OrderedLocusNames) > > Still, comparing these two queries directly against the PostgreSQL database > would be educational, I think. Then, knowing which criteria are correct, > the appropriate action can then be taken, I think. > > Hope this helps... > > John David N. Dionisio, PhD > Associate Professor, Computer Science > Loyola Marymount University > > > > On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > > > Debug export is still going... 2.5GB of log files so far with progress at > 65%... > > > > I posted the link of the WARN log on the plasmodium page here: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum. > > Richard > > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> > wrote: > > Hi all, > > > > Have been working through several Plasmodium gdb exports in an attempt to > source why only one gene id makes it into the Ordered Locus table. > > > > I have reviewed the logger file while set to "WARN" and wasn't able to > determine anything which would suggest an error. I will post this log file > to the wiki later today when I get home. > > > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with > hopes that more detail will surface the issue, but my export is on hour 20 > and still going (although its nearly complete). What I didn't expect was the > size of the log files and that it seems only the last 3 are kept with > earlier logs being overwritten =( I fear that the information I need it in > one of the earlier files which are now lost. > > > > Unless a better suggestion is offered I'm going to rerun an export again > with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that > 3 GB total will be enough to hold the complete export log. > > > > More info as it comes... > > > > Richard > > > > > > > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> > wrote: > > Hi, > > > > I've completed testing the Plasmodium gdb I exported last November and > updated the SourceForge wiki. > > > > Plasmodium has it's own task list page, which I've updated here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > > > The testing report can be found here: > https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > > > The source files and gdb are on a new Plasmodium falciparum page on the > Fall 2010 BiolDB wiki: > https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > > > Here is the list of bugs/action items that I've listed: > > > > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 > repored by the TallyEngine. This also affects all other tables related to > OrderedLocusNames. > > > > 2. The GeneId table in the database has 6 fewer IDs than reported by the > TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also > have mysterious GeneId issues with the TallyEngine). > > > > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the > 2009 version of the gdb. There's no way to tell at this point whether this > is due to a change in annotation by UniProt or is a bug with GenMAPP > Builder. > > > > Thanks, > > Kam > > > > > > > ------------------------------------------------------------------------------ > > What You Don't Know About Data Connectivity CAN Hurt You > > This paper provides an overview of data connectivity, details > > its effect on application quality, and explores various alternative > > solutions. http://p.sf.net/sfu/progress-d2d > > _______________________________________________ > > xmlpipedb-developer mailing list > > xml...@li... > > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > > > > > <ATT00001..txt><ATT00002..txt> > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > |
From: John D. N. D. <do...@lm...> - 2011-03-14 04:55:53
|
Thanks for the updates, Rich. I gave things a once-over and may have a lead. Here is what I found: - First, the TallyEngine customization for P. falciparum states the following: # Plasmodium falciparum plasmodiumfalciparum_level_amount=2 plasmodiumfalciparum_element_level0=uniprot/entry/gene/name&type&ORF plasmodiumfalciparum_element_level1=uniprot/entry/gene/name&type&UniGene plasmodiumfalciparum_query_level0=select count(*) from genenametype where type = 'ORF'; plasmodiumfalciparum_query_level1=select count(*) from genenametype where type = 'UniGene'; plasmodiumfalciparum_table_name_level0=Ordered Locus plasmodiumfalciparum_table_name_level1=UniGene Thus, what is being counted by TallyEngine as "Ordered Locus" are the gene names whose type is 'ORF' ("level0" properties). - Now, this is what the P. falciparum species profile does when harvesting IDs (PlasmodiumFalciparumUniProtSpeciesProfile.getSystemTableManagerCustomizations): String sqlQuery = "select d.entrytype_gene_hjid as hjid, c.value " + "from genenametype c inner join entrytype_genetype d " + "on (c.entrytype_genetype_name_hjid = d.hjid) " + "where (c.value similar to ? " + "or c.value similar to ? " + "or c.value similar to ?) " + "and type <> 'ordered locus names' " + "and type <> 'ORF' " + "group by d.entrytype_gene_hjid, c.value"; Note the condition on the second-to-last line --- the query actually *omits* gene names whose type is 'ORF'! So the question is...which is right? (I'm inclined to believe the Tally Engine here, since, the export puts only one record in OrderedLocusNames) Still, comparing these two queries directly against the PostgreSQL database would be educational, I think. Then, knowing which criteria are correct, the appropriate action can then be taken, I think. Hope this helps... John David N. Dionisio, PhD Associate Professor, Computer Science Loyola Marymount University On Mar 12, 2011, at 9:48 AM, Richard Brous wrote: > Debug export is still going... 2.5GB of log files so far with progress at 65%... > > I posted the link of the WARN log on the plasmodium page here: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum. > Richard > On Fri, Mar 11, 2011 at 1:06 PM, Richard Brous <rbr...@gm...> wrote: > Hi all, > > Have been working through several Plasmodium gdb exports in an attempt to source why only one gene id makes it into the Ordered Locus table. > > I have reviewed the logger file while set to "WARN" and wasn't able to determine anything which would suggest an error. I will post this log file to the wiki later today when I get home. > > I then upped the logger verbosity to "DEBUG" and file size to 100MB with hopes that more detail will surface the issue, but my export is on hour 20 and still going (although its nearly complete). What I didn't expect was the size of the log files and that it seems only the last 3 are kept with earlier logs being overwritten =( I fear that the information I need it in one of the earlier files which are now lost. > > Unless a better suggestion is offered I'm going to rerun an export again with 'DEBUG" verbosity and up the file sizes to near 1 GB each and hope that 3 GB total will be enough to hold the complete export log. > > More info as it comes... > > Richard > > > > > > On Fri, Mar 4, 2011 at 3:17 PM, Kam Dahlquist <kda...@lm...> wrote: > Hi, > > I've completed testing the Plasmodium gdb I exported last November and updated the SourceForge wiki. > > Plasmodium has it's own task list page, which I've updated here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Plasmodium_falciparum_Task_List > > The testing report can be found here: https://sourceforge.net/apps/mediawiki/xmlpipedb/index.php?title=Gene_Database_Testing_Report_P._falciparum_20101115 > > The source files and gdb are on a new Plasmodium falciparum page on the Fall 2010 BiolDB wiki: https://www.cs.lmu.edu/biodb/fall2010/index.php/Plasmodium_falciparum > > Here is the list of bugs/action items that I've listed: > > 1. The OrderedLocusNames table in the gdb only has 1 ID out of 5345 repored by the TallyEngine. This also affects all other tables related to OrderedLocusNames. > > 2. The GeneId table in the database has 6 fewer IDs than reported by the TallyEngine (Mycobacterium smegmatis and Mycobacterium tuberculosis also have mysterious GeneId issues with the TallyEngine). > > 3. The count for EMBL IDs in the gdb also seems low, it's lower than the 2009 version of the gdb. There's no way to tell at this point whether this is due to a change in annotation by UniProt or is a bug with GenMAPP Builder. > > Thanks, > Kam > > > ------------------------------------------------------------------------------ > What You Don't Know About Data Connectivity CAN Hurt You > This paper provides an overview of data connectivity, details > its effect on application quality, and explores various alternative > solutions. http://p.sf.net/sfu/progress-d2d > _______________________________________________ > xmlpipedb-developer mailing list > xml...@li... > https://lists.sourceforge.net/lists/listinfo/xmlpipedb-developer > > > > <ATT00001..txt><ATT00002..txt> |