You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(11) |
Jul
(34) |
Aug
(14) |
Sep
(10) |
Oct
(10) |
Nov
(11) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(56) |
Feb
(76) |
Mar
(68) |
Apr
(11) |
May
(97) |
Jun
(16) |
Jul
(29) |
Aug
(35) |
Sep
(18) |
Oct
(32) |
Nov
(23) |
Dec
(77) |
2004 |
Jan
(52) |
Feb
(44) |
Mar
(55) |
Apr
(38) |
May
(106) |
Jun
(82) |
Jul
(76) |
Aug
(47) |
Sep
(36) |
Oct
(56) |
Nov
(46) |
Dec
(61) |
2005 |
Jan
(52) |
Feb
(118) |
Mar
(41) |
Apr
(40) |
May
(35) |
Jun
(99) |
Jul
(84) |
Aug
(104) |
Sep
(53) |
Oct
(107) |
Nov
(68) |
Dec
(30) |
2006 |
Jan
(19) |
Feb
(27) |
Mar
(24) |
Apr
(9) |
May
(22) |
Jun
(11) |
Jul
(34) |
Aug
(8) |
Sep
(15) |
Oct
(55) |
Nov
(16) |
Dec
(2) |
2007 |
Jan
(12) |
Feb
(4) |
Mar
(8) |
Apr
|
May
(19) |
Jun
(3) |
Jul
(1) |
Aug
(6) |
Sep
(12) |
Oct
(3) |
Nov
|
Dec
|
2008 |
Jan
(4) |
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(21) |
2009 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(1) |
Jun
(8) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
(1) |
Mar
(4) |
Apr
(3) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
(4) |
May
(19) |
Jun
(14) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
(22) |
Apr
(12) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(1) |
Feb
(1) |
Mar
|
Apr
(1) |
May
|
Jun
(2) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ed R. <ed_...@be...> - 2004-10-21 13:35:23
|
We had this problem in a couple other projects I worked on. We solved it by having two ID fields in the subject table, and internal/original and an external/updated ID. Might it not be better to have the initial load of the taxon fill in an original ID and a current ID and then, if it ever changes, update a LastID and a CurrentID field to keep track of where NCBI is? This way, we can use the original ID to stay consistent internally to the GUS records and use LastID and CurrentID to keep track of where NCBI is going. All of this would happen in the update routine of the plugin. It would require two new fields in the SRes.taxon table. -Ed > > From: pi...@pc... > Date: 2004/10/20 Wed PM 03:47:17 EDT > To: gus...@li... > Subject: [Gusdev-gusdev] LoadTaxonomy.pm > > > The LoadTaxonomy plugin was written to load taxonomy information from tables > downloaded from NCBI. It was intended that the plugin update and not replace > rows in the taxonomy tables and that the taxon_ids remain stable. It was > assumed that NCBI didn't replace their tax_ids but in fact they do although at > a low rate. This results in duplications in the taxon_ids that represent the > same taxonomic group. There is a merged.dmp file included in their tar ball > that contains a list of old to new tax_id mappings and seems to be cumulative. > I have written them to confirm that all replacements are in the file and that > it is cumulative. > > I would like to add a subroutine to the LoadTaxon plugin that would be called > first to replace the deprecated ncbi_tax_ids in sres.taxon with their > replacements > and then continue with the plugin as it is. This would require the addition of > another option for the merged.dmp file. I was not going to make this an > optional task but I will try to build in a time saver for first time use. > > Any comments? Please respond quickly as I need to run the plugin ASAP. > > -Debbie > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > Ed Robinson 255 Deerfield Rd Bogart, GA 30622 (706)425-9181 --Learn more about the face of your neighbor, and less about your own. -Sargent Shriver |
From: <pi...@pc...> - 2004-10-20 19:47:27
|
The LoadTaxonomy plugin was written to load taxonomy information from tables downloaded from NCBI. It was intended that the plugin update and not replace rows in the taxonomy tables and that the taxon_ids remain stable. It was assumed that NCBI didn't replace their tax_ids but in fact they do although at a low rate. This results in duplications in the taxon_ids that represent the same taxonomic group. There is a merged.dmp file included in their tar ball that contains a list of old to new tax_id mappings and seems to be cumulative. I have written them to confirm that all replacements are in the file and that it is cumulative. I would like to add a subroutine to the LoadTaxon plugin that would be called first to replace the deprecated ncbi_tax_ids in sres.taxon with their replacements and then continue with the plugin as it is. This would require the addition of another option for the merged.dmp file. I was not going to make this an optional task but I will try to build in a time saver for first time use. Any comments? Please respond quickly as I need to run the plugin ASAP. -Debbie |
From: Paul M. <pj...@sa...> - 2004-10-20 19:07:42
|
Ed, I've increased a lot of our description fields to 2048 but commiting them to CVS has no real affect. Angel and Co. dump the schema they have from the Oracle DB installed at CBIL. I think this issue has been addressed already on this mailing list - I can't find the recent email about a new install system for GUS that will address this... Paul/ On 20 Oct 2004, at 19:46, Ed Robinson wrote: > NRDB has a number of descriptions which are longer than the > varchar(255) field GUS allots for them. Presently these are being > truncated when they are entered into the database. Would anyone be > opposed to upping the description field size to fit these in? > > -Ed R > > > Ed Robinson > 255 Deerfield Rd > Bogart, GA 30622 > (706)425-9181 > > --Learn more about the face of your neighbor, and less about your own. > -Sargent Shriver > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on > ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give > us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |
From: Ed R. <ed_...@be...> - 2004-10-20 18:46:30
|
NRDB has a number of descriptions which are longer than the varchar(255) field GUS allots for them. Presently these are being truncated when they are entered into the database. Would anyone be opposed to upping the description field size to fit these in? -Ed R Ed Robinson 255 Deerfield Rd Bogart, GA 30622 (706)425-9181 --Learn more about the face of your neighbor, and less about your own. -Sargent Shriver |
From: Steve F. <sfi...@pc...> - 2004-10-19 00:14:55
|
ed- i think we should pull together a working group to look into these issues. give me a day or so to see if i can organize something steve Ed Robinson wrote: >We have two databases up and running in GUS. One, the Crypto database, used GBParser. The other, for TCruzi, was data in TIGR XML, and used that plugin. Superficial analysis of the installs, though, seems to indicate that the data from the GBParser loaded into slightly different tables than the TCruzi data, and this data load (for the GBParser) was incomplete for the tables that should have been filled. > >One example: Only half the features in NAFeatureImp map to something in NAFeatureNAProtein. > >Another example is NAProtein links back to NAFeatureNAProtein, but is doesn't link back to GeneFeature because there are 0 joins from GeneFeature to NAFeatureNAProtein. > >Anyway, we have a number of questions we would like to open up about the consistency of these plugins. Should they both be using the GUS Schema the same ways? Does the schema contain different tables to be used only be different plugins? Do we, in fact, have a model of what tables data sources are supposed to load that was used when writting these plugins? We all assume that the schema is supposed to be used consistently by all plugins no matter what their data source. But are there old, abandoned tables that we shouldn't be filling any longer? > > >-ed > > > > >Ed Robinson >255 Deerfield Rd >Bogart, GA 30622 >(706)425-9181 > >--Learn more about the face of your neighbor, and less about your own. > -Sargent Shriver > > > >------------------------------------------------------- >This SF.net email is sponsored by: IT Product Guide on ITManagersJournal >Use IT products in your business? Tell us what you think of them. Give us >Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more >http://productguide.itmanagersjournal.com/guidepromo.tmpl >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > |
From: Ed R. <ed_...@be...> - 2004-10-18 18:11:42
|
We have two databases up and running in GUS. One, the Crypto database, used GBParser. The other, for TCruzi, was data in TIGR XML, and used that plugin. Superficial analysis of the installs, though, seems to indicate that the data from the GBParser loaded into slightly different tables than the TCruzi data, and this data load (for the GBParser) was incomplete for the tables that should have been filled. One example: Only half the features in NAFeatureImp map to something in NAFeatureNAProtein. Another example is NAProtein links back to NAFeatureNAProtein, but is doesn't link back to GeneFeature because there are 0 joins from GeneFeature to NAFeatureNAProtein. Anyway, we have a number of questions we would like to open up about the consistency of these plugins. Should they both be using the GUS Schema the same ways? Does the schema contain different tables to be used only be different plugins? Do we, in fact, have a model of what tables data sources are supposed to load that was used when writting these plugins? We all assume that the schema is supposed to be used consistently by all plugins no matter what their data source. But are there old, abandoned tables that we shouldn't be filling any longer? -ed Ed Robinson 255 Deerfield Rd Bogart, GA 30622 (706)425-9181 --Learn more about the face of your neighbor, and less about your own. -Sargent Shriver |
From: <pi...@pc...> - 2004-10-16 15:52:05
|
I recently (within the last 3 weeks) loaded an entirely new version of nrdb and it took less than 24 hours. This should have been equivalent to a first load. I think that something else was wrong when you ran the plugin, possibly with the database (e.g. indexes missing, a need to update statistics). I agree that LoadNRDB needs an upgrade but I think its poor performance in this case is due to some other problem. -Debbie Quoting Ed Robinson <ed_...@be...>: > As many of you know, we have been doing quite a few GUS installs down here, > and this has pushed me to try and simplify this process as much as possible. > I am now far enough along on a couple things to bring them up on the list. > > First, installing NRDB the first time in GUS is a horribly painful process > using the exisintg plugin and this pain seems to be needless since it is an > empty database. To this end, I have written a couple scripts and a batch > process for Oracle SQLLoader which accomplishes in about an hour what takes a > few weeks with the plugin. However, to make this work, I have to reserve > early rows in a number of SRES tables for meaningful entries in columns such > as row_alg_invocation_id. Hence, my first discussion item: Should we > consider reserving early values in a number of the SRes tables to serve as > standard values. We already require that some rows be entered in GUS early > on to make some-things work such as LoadReviewType. It would seem that we > should Pre-populate some of these tables with basic values that we can then > refer to as standard values for bootstrapping operations such as a bulk load > of NRDB. Does anyone else see any value in this and, if so, what fileds > should we create standard entries for? Also, is there anything else that > would be amenable to a batch process for bootstrapping? (Note: I do NOT > think any organisim specific data is amenable to bootstrapping. That is what > a (object based) pipeline is for. Also, this batch process is only good if > you are using Oracle, but a similar process cab be written there too.) > > This also gets me to some of the other scripts we use to bootstrap GUS, such > as the predefined set of ExternalDatabases we load. The XML which I use to > load this is pretty messy, and not well documented. Does anyone mind if I > clean it up? If the answer is yes, is there anything I should know about > this file? it seems that the XML for this table load is a nice one to > clean-up and make standard for GUS installations all over since it will push > gus to be standardized across installations. What else should we > standardize? > > Which now brings me to the last item I want to open up which is that I am > close to completing a full GUS installation wrapper script which essentially > makes a GUS installation a click-and-play operation. One of our > deliverables is supposed to be an easy to install GUS package. Regardless of > the state of GUS with regards to an official release, this script is going to > make my life a whole lot easier. I figure it might be nice to package the > whole kit-n-kaboodle up into one nice fat tarball with a simple set of > instructions for download from someplace. Is anyone else interested in this? > > Finally, one quick question I have about the NRDB load is that working on it > showed me that the description filed in AASequenceIMP is too short for many > of the descriptions in NRDB. Do we want to up the description field size for > dots.aasequenceimp? > > Anyway, any feedback on this would be appreciated. > > -Ed R > > > Ed Robinson > 255 Deerfield Rd > Bogart, GA 30622 > (706)425-9181 > > --Learn more about the face of your neighbor, and less about your own. > -Sargent Shriver > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |
From: Michael S. <msa...@pc...> - 2004-10-15 20:26:09
|
Ed, Ed Robinson wrote: > > Which now brings me to the last item I want to open up which is that > I am close to completing a full GUS installation wrapper script which > essentially makes a GUS installation a click-and-play operation. > One of our deliverables is supposed to be an easy to install GUS > package. Regardless of the state of GUS with regards to an official > release, this script is going to make my life a whole lot easier. I > figure it might be nice to package the whole kit-n-kaboodle up into > one nice fat tarball with a simple set of instructions for download > from someplace. Is anyone else interested in this? > As previously mentioned on this list, there is an effort underway to improve the administration of GUS, including a new turn-key installation mechanism (for both Oracle and Postgres). It's my hope to have this implemented by the next GUS release (within the couple of months). I think it's great that you have a wrapper script, which is probably of interest to several groups on this list, but I would warn against duplication of effort and encourage you to sign up and participate in this effort: https://mail.pcbi.upenn.edu/mailman/listinfo/gusdba http://mail.pcbi.upenn.edu/pipermail/gusdba/ (Archives) Perhaps you could send your script to that list-- I think it will serve as a really good foundation for our installer efforts. Thanks, Mike |
From: Paul M. <pj...@sa...> - 2004-10-15 20:04:55
|
On 15 Oct 2004, at 20:41, Ed Robinson wrote: > This also gets me to some of the other scripts we use to bootstrap > GUS, such as the predefined set of ExternalDatabases we load. The XML > which I use to load this is pretty messy, and not well documented. > Does anyone mind if I clean it up? If the answer is yes, is there > anything I should know about this file? it seems that the XML for > this table load is a nice one to clean-up and make standard for GUS > installations all over since it will push gus to be standardized > across installations. What else should we standardize? > I documented and included the XML we required at Sanger to set-up a new instance of GUS; http://www.gusdb.org/wiki/index.php/Bootstrap%20data http://www.gusdb.org/wiki/index.php/Bootstrap%20data%20II The first link is probably the common stuff that can be shared between sites. The ExternalDatabase XML is based on some XML we originally got from CBIL and I just added DBs as required so it probably needs to be cleaned up. The second is more Sanger specific but it gives people XML they can at least edit and understand how someone else has implemented it. Paul. |
From: Ed R. <ed_...@be...> - 2004-10-15 19:41:33
|
As many of you know, we have been doing quite a few GUS installs down here, and this has pushed me to try and simplify this process as much as possible. I am now far enough along on a couple things to bring them up on the list. First, installing NRDB the first time in GUS is a horribly painful process using the exisintg plugin and this pain seems to be needless since it is an empty database. To this end, I have written a couple scripts and a batch process for Oracle SQLLoader which accomplishes in about an hour what takes a few weeks with the plugin. However, to make this work, I have to reserve early rows in a number of SRES tables for meaningful entries in columns such as row_alg_invocation_id. Hence, my first discussion item: Should we consider reserving early values in a number of the SRes tables to serve as standard values. We already require that some rows be entered in GUS early on to make some-things work such as LoadReviewType. It would seem that we should Pre-populate some of these tables with basic values that we can then refer to as standard values for bootstrapping operations such as a bulk load of NRDB. Does anyone else see any value in this and, if so, what fileds should we create standard entries for? Also, is there anything else that would be amenable to a batch process for bootstrapping? (Note: I do NOT think any organisim specific data is amenable to bootstrapping. That is what a (object based) pipeline is for. Also, this batch process is only good if you are using Oracle, but a similar process cab be written there too.) This also gets me to some of the other scripts we use to bootstrap GUS, such as the predefined set of ExternalDatabases we load. The XML which I use to load this is pretty messy, and not well documented. Does anyone mind if I clean it up? If the answer is yes, is there anything I should know about this file? it seems that the XML for this table load is a nice one to clean-up and make standard for GUS installations all over since it will push gus to be standardized across installations. What else should we standardize? Which now brings me to the last item I want to open up which is that I am close to completing a full GUS installation wrapper script which essentially makes a GUS installation a click-and-play operation. One of our deliverables is supposed to be an easy to install GUS package. Regardless of the state of GUS with regards to an official release, this script is going to make my life a whole lot easier. I figure it might be nice to package the whole kit-n-kaboodle up into one nice fat tarball with a simple set of instructions for download from someplace. Is anyone else interested in this? Finally, one quick question I have about the NRDB load is that working on it showed me that the description filed in AASequenceIMP is too short for many of the descriptions in NRDB. Do we want to up the description field size for dots.aasequenceimp? Anyway, any feedback on this would be appreciated. -Ed R Ed Robinson 255 Deerfield Rd Bogart, GA 30622 (706)425-9181 --Learn more about the face of your neighbor, and less about your own. -Sargent Shriver |
From: Steve F. <sfi...@pc...> - 2004-10-13 15:33:37
|
mark- oops. it is supposed to be an optional attribute of <sqlQuery>, and the model code supports it. but, the wdkModel.rng schema file forgot about it. i have added it to wdkModel.rng. you can get it from cvs: WDK/Model/lib/rng/wdkModel.rng couple of things: 1. i'd be interested to know why you want to turn off caching. 2. glad to hear you are using an xml editor. which one are you using? steve Mark S. Heiges wrote: > > The WDK wiki docs say I can turn off query caching by setting the > isCacheable attribute to false. Where do I put that attribute? It's > not mentioned in the wdkModel.rng so my xml editor complains if I put > it anywhere in the modelXmlFile > > Mark > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Mark S. H. <mh...@ug...> - 2004-10-13 14:57:55
|
The WDK wiki docs say I can turn off query caching by setting the isCacheable attribute to false. Where do I put that attribute? It's not mentioned in the wdkModel.rng so my xml editor complains if I put it anywhere in the modelXmlFile Mark |
From: <pi...@pc...> - 2004-10-12 19:51:18
|
I'm soliciting opinions on the issue of rewriting LoadTaxon.pm because of the following problem. The LoadTaxon plugin is designed to load taxon information from NCBI into multiple tables including sres.taxon and sres.taxonname. On occassion tax_ids are deleted or old tax_ids are replaced with new tax_ids. The merged.dmp and delnodes.dmp files contained in the same tar ball are not always consistent with these changes. For instance, for the following example, neither tax_id was in any available merged.dmp file (up through July) and only appeared in a delnodes file in a later tar ball. An example: Lactobacillus kefiranofaciens: tax_id = 190905 through March/2004 267818 April/2004 and later Written with the assumption that tax_ids were relatively stable, the LoadTaxon plugin deletes only taxonname entries when the row for a specific taxon_id and name_class has been replaced. Taxon rows are updated but never deleted. This is not really an acceptable approach because eventually there is an accumulation of rows in taxonname that duplicate name with the same name_class but different taxon_ids. I think the LoadTaxon plugin should be rewritten to avoid duplications but to retain taxon_id stability. The plugin should not delete taxon rows because of the obvious referencing constraint problems but should update the rows as it does now but including an update of ncbi_tax_id. Does anyone have any suggestions or insight on this subject? |
From: Jinal J. <jjh...@vb...> - 2004-10-11 15:55:49
|
Thanks a lot Sucheta and Jeetendra, You are right, the gbparser expects a space after the "ORIGIN" On Monday 11 October 2004 11:45 am, Jeetendra Soneja wrote: > Hi Jinal, > I had the same problem. After trying several files, I figured out > that the problem was the trailing spaces after the keyword ORIGIN. > > Try inserting few spaces after ORIGIN. It worked for me atleast. I > think I had mailed about this problem before, but didn't find any > better solution. > > --Jeetendra. > > > I am trying to load a genbank file and the gbparser keeps giving me this > > message and finally goes out of memory and dies. Any idea what can be the > > problem? This genbank file was created by me from our newly sequenced > > genome > > and I am pretty sure that there is something missing in the format which > > the > > gbparser expects and that's why it dies. I validated my genbank file > > against > > "sequin" available from ncbi and everything looks good. > > > > Here is the message (the line "VLD CBIL::Bio::GenBank::Origin 1" > > goes > > on and on) > > > > VLD CBIL::Bio::GenBank::Features 1 > > VLD CBIL::Bio::GenBank::BaseCount 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > VLD CBIL::Bio::GenBank::Origin 1 > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > > Use IT products in your business? Tell us what you think of them. Give us > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > > more > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > _______________________________________________ > > Gusdev-gusdev mailing list > > Gus...@li... > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Jeetendra S. <so...@vb...> - 2004-10-11 15:45:38
|
Hi Jinal, I had the same problem. After trying several files, I figured out that the problem was the trailing spaces after the keyword ORIGIN. Try inserting few spaces after ORIGIN. It worked for me atleast. I think I had mailed about this problem before, but didn't find any better solution. --Jeetendra. > I am trying to load a genbank file and the gbparser keeps giving me this > message and finally goes out of memory and dies. Any idea what can be the > problem? This genbank file was created by me from our newly sequenced > genome > and I am pretty sure that there is something missing in the format which > the > gbparser expects and that's why it dies. I validated my genbank file > against > "sequin" available from ncbi and everything looks good. > > Here is the message (the line "VLD CBIL::Bio::GenBank::Origin 1" > goes > on and on) > > VLD CBIL::Bio::GenBank::Features 1 > VLD CBIL::Bio::GenBank::BaseCount 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > VLD CBIL::Bio::GenBank::Origin 1 > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > -- Jeetendra Soneja Research Associate Virginia Bioinformatics Institute 1880 Pratt Dr., Building XV Blacksburg, VA 24060, USA Phone: (540)-231-2789 http://www.vbi.vt.edu |
From: Sucheta T. <su...@vb...> - 2004-10-11 15:41:43
|
Hi Jinal, I know why it is happening. I have uploaded all the files in your home directory.I have commented somewhere in make_genbank.pl program about this error. Please check that. I have also uploaded the sample files onto your home directory. --sucheta At 11:27 AM 10/11/2004 -0400, Jinal Jhaveri wrote: >I am trying to load a genbank file and the gbparser keeps giving me this >message and finally goes out of memory and dies. Any idea what can be the >problem? This genbank file was created by me from our newly sequenced genome >and I am pretty sure that there is something missing in the format which the >gbparser expects and that's why it dies. I validated my genbank file against >"sequin" available from ncbi and everything looks good. > >Here is the message (the line "VLD CBIL::Bio::GenBank::Origin 1" >goes >on and on) > >VLD CBIL::Bio::GenBank::Features 1 >VLD CBIL::Bio::GenBank::BaseCount 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 >VLD CBIL::Bio::GenBank::Origin 1 > > >------------------------------------------------------- >This SF.net email is sponsored by: IT Product Guide on ITManagersJournal >Use IT products in your business? Tell us what you think of them. Give us >Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more >http://productguide.itmanagersjournal.com/guidepromo.tmpl >_______________________________________________ >Gusdev-gusdev mailing list >Gus...@li... >https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Jinal J. <jjh...@vb...> - 2004-10-11 15:27:54
|
I am trying to load a genbank file and the gbparser keeps giving me this message and finally goes out of memory and dies. Any idea what can be the problem? This genbank file was created by me from our newly sequenced genome and I am pretty sure that there is something missing in the format which the gbparser expects and that's why it dies. I validated my genbank file against "sequin" available from ncbi and everything looks good. Here is the message (the line "VLD CBIL::Bio::GenBank::Origin 1" goes on and on) VLD CBIL::Bio::GenBank::Features 1 VLD CBIL::Bio::GenBank::BaseCount 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 VLD CBIL::Bio::GenBank::Origin 1 |
From: Elisabetta M. <man...@pc...> - 2004-10-11 14:28:44
|
Hi Juan, I think we might need a bit more info from you to help solve the problem. It sounds like you are getting this error while using the RAD.StudyAnnotator. Where exactly did you get the error message copied below from (you ran the SA pushing the Test button or it's from log files or from where?) My understanding is that the xml file generated by the StudyAnnotator is correct and you can load it separately using the UpdateGusFromXML plugin but the problem is that it's not being loaded by the StudyAnnotator upon submitting. Could you email me, Junmin, and Steve a copy of your .gus.properties file (after removing passwords) and also a copy of your studyAnnotatorWebConfig.prop that you are using to build the Study Annotator? Elisabetta --- On Fri, 8 Oct 2004, Juan Carlos Perin wrote: > We get the following error when trying to create a new study and add new > contacts. It seems that the xml files that are generated in order to > update, are not being executed. Possibly because they are not being found > by the php script executing GA updategusfromxml, or perhaps something is not > right in .gus.properties, or another config file. We are stuck because we > have tried many combinations of fixes, but none seem to work, and unless we > manually run updategusfromxml on the command line with the xml files that > are generated, things don't get updated. Any ideas would help. THANKS > > > ArrayRequired property 'coreSchemaName' must be specified in /checkout/GUS > at /checkout/GUS/lib/perl/CBIL/Util/PropertySet.pm line 53.GUS::ObjR > elP::DbiDbHandle connect('','GUSrw',...) failed: (no error string) at > /checkout/GUS/lib/perl/GUS/ObjRelP/DbiDatabase.pm line 174Can't bless non > -reference value at /checkout/GUS/lib/perl/GUS/ObjRelP/DbiDbHandle.pm line > 19.Required property 'coreSchemaName' must be specified in /checkout > /GUS at /checkout/GUS/lib/perl/CBIL/Util/PropertySet.pm line 53. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |
From: Juan C. P. <bi...@ge...> - 2004-10-08 20:50:47
|
We get the following error when trying to create a new study and add new contacts. It seems that the xml files that are generated in order to update, are not being executed. Possibly because they are not being found by the php script executing GA updategusfromxml, or perhaps something is not right in .gus.properties, or another config file. We are stuck because we have tried many combinations of fixes, but none seem to work, and unless we manually run updategusfromxml on the command line with the xml files that are generated, things don't get updated. Any ideas would help. THANKS ArrayRequired property 'coreSchemaName' must be specified in /checkout/GUS at /checkout/GUS/lib/perl/CBIL/Util/PropertySet.pm line 53.GUS::ObjR elP::DbiDbHandle connect('','GUSrw',...) failed: (no error string) at /checkout/GUS/lib/perl/GUS/ObjRelP/DbiDatabase.pm line 174Can't bless non -reference value at /checkout/GUS/lib/perl/GUS/ObjRelP/DbiDbHandle.pm line 19.Required property 'coreSchemaName' must be specified in /checkout /GUS at /checkout/GUS/lib/perl/CBIL/Util/PropertySet.pm line 53. |
From: Steve F. <sfi...@pc...> - 2004-10-08 18:44:16
|
Arnaud- how about this for arg names: - queryIdColumnName - subjectIdColumnName (and give a nice string in the h=> ) also, about the use statements. those are wrong and naughty. they are presuming that the table names passed in as args is one or another of the objects in the use statements. we should remove them. instead, we need to do the use at "runtime" not "compile time." here, i think (good luck!), is how: my $args = ??? # here construct the args needed to find the object in the db my $require_q = "{require $queryTable; $queryTable->new($args) }"; my $queryobj = eval $require_q; and the same for the subject steve Arnaud Kerhornou wrote: > Ok, > > Steve Fischer wrote: > >> arnaud- >> >> ok, would you like to do the upgrade to LoadBlastSimFast? >> > So the plugin requires two extra parameters, '-queryIdAttr' and > '-subjectIdAttr', is that right ? > >> if i understand correctly this would avoid having to modify the >> schema, is that right? > > > I think so. > > Another thing, the plugin requires 'use' statements for loading the > sequence objects we want to attach similarity data to. > Could we bypass somehow this declaration as in theory we would want to > attach similarity data to any view on the top of NASequenceImp or > AASequenceImp. By instanciating AASequence or NASequence superclass > objects and using the subclass_view attribute to affect to the correct > view the data, would it be feasible this way ? > >> >> steve >> >> Arnaud Kerhornou wrote: >> >>> >>> Steve Fischer wrote: >>> >>>> Arnaud- >>>> >>>> ok, i've looked at LoadBlastSimFast.pm. I see the addition of >>>> the logic to use name to get the query object (and i know that we >>>> discussed this in mail back in august). >>>> >>>> I am having some second thoughts about that change as it stands. >>>> The original intent of the plugin was that the sequences submitted >>>> to the blast process have been extracted from the database and >>>> therefore have the primary key in their definition line. >>>> >>>> I think I understand that it would be useful to be able to skip >>>> that step, ie, blast sequences using their native identifiers, and >>>> then have the plugin discover what their internal primary key >>>> is. That's what you want to do right? >>>> >>> that's right. >>> >>>> Does anybody know of any reason why that would not be ok? >>>> >>>> Assuming that nobody has any objections, maybe the best solution >>>> would be to improve the plugin to take optional arguments that >>>> specify the name of the query and/or subject identifier >>>> attributes? For example: -queryIdAttr source_id. This would >>>> give us full flexibility (and also avoid the slightly risky >>>> assumption that a digits-only identifier must be the primary key) >>>> >>> that sounds sensible. >>> >>>> steve >>>> >>>> Arnaud Kerhornou wrote: >>>> >>>>> >>>>> Steve Fischer wrote: >>>>> >>>>>> Arnaud- >>>>>> >>>>>> see below. >>>>>> >>>>>> steve >>>>>> >>>>>> Arnaud Kerhornou wrote: >>>>>> >>>>>>> Hi everyone >>>>>>> >>>>>>> To be able to reproduce the OrthoMCL method, I would like to >>>>>>> raise two issues we've got: >>>>>>> >>>>>>> * The first issue relates to the view where are stored the >>>>>>> protein sequences. I was thinking to use the >>>>>>> TranslatedAASequence view as this one contains the translated >>>>>>> sequences of our gene models. The problem I have is that it is >>>>>>> missing a name attribute so I can not match the blast output >>>>>>> query and subject names with the data into GUS (I didn't want to >>>>>>> use the TranslatedAASequence primary keys as the identifiers of >>>>>>> my proteins of interest). >>>>>>> Could we add a name attribute to this view ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> hmm. not quite following. what would this name be, where >>>>>> would it be derived from? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> By default we assign the systematic id of the corresponding CDS to >>>>> the protein name. >>>>> >>>>>> why not use source_id and/or secondary_identifier? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> We could do that, but in any case that would involve to modify the >>>>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to >>>>> get the sequences entries. At the moment the match is made on the >>>>> primary key (which I want to avoid) or the name attribute. The >>>>> source_id attribute would do instead of the name attribute. It >>>>> must work for any blast (DNA Vs DNA or Protein Vs Protein) with >>>>> the various potential GUS sequence objects we want to attach >>>>> similarity data to. As far as I can see the source_id attribute is >>>>> present in all of them (AASequenceImp and NASequenceImp tables). >>>>> >>>>>> or, presumably this translated sequence has a relationship back >>>>>> to its na sequence (although i don't immediately see that in the >>>>>> schema browser), so couldn't you get a name or source_id from there? >>>>>> >>>>> That would require a more sophisticate query to get the sequence >>>>> entry. >>>>> >>>>>>> >>>>>>> * The second issue relates to the BLAST output parsing, done by >>>>>>> a module called BlastAnal.pm in the CBIL package. This module >>>>>>> seems to parse BLAST output file with only one query sequence. I >>>>>>> have more than one query sequence reported so I had to change >>>>>>> the code of this module to allow more than one query sequence. >>>>>>> Can my code be integrated to CBIL package ? Note that I didn't >>>>>>> change the interface of this module so it doesn't affect the >>>>>>> scripts that are using it, I'm thinking in particular of >>>>>>> parseBlastFilesForSimilarity.pl >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> this sounds ok. how about we just take a quick look at this >>>>>> together while you are visiting? then we can fold it into the >>>>>> code base. do you want to send it by mail? >>>>>> >>>>> That's fine, the module is attached. >>>>> >>>>>>> >>>>>>> cheers >>>>>>> Arnaud >>>>>>> >>>>>>> >> > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Arnaud K. <ax...@sa...> - 2004-10-08 16:32:00
|
Hi While loading our whole set of data into the Feature-Land/Central Dogma part of GUS, we've got some issues with the items below. I would like to know what other groups using GUS think about them. * Ontologies: => parsing the OBO format. As the OBO format has become the standard for ontologies, is there any plan to update the GO loading plugin to be able to parse this format ? => GO and sequence ontology are in different tables, Could we generalize the schema for handling ontologies, by designing a unique set of tables that would be used for loading any ontology. In future we will need extra ontologies, I'm thinking of the Evidence Code Ontology for example but that should apply to any ontology part of the OBO project (http://obo.sourceforge.net). * Gene Naming nomenclature: There was a thread a while ago about redesigning the naming nomenclature of genes in GUS, I don't think anything has been implemented yet. This is something important for us and it is not fully covered yet by GUS. This is our requirements: * /systematic_id - final systematic name for when chromosome is finished or stable sequence is submitted, will be title for gene page in absence of a primary_name. * /temporary_systematic_id - for temporary systematic name used during projects where sequence is unfinished, i.e temporary name for the shotgun sequences. * /previous_systematic_id - for systematic names *no longer in use*. Can be seen on geneDB pages and queried. When systematic_id is assigned, the temporary_systematic_id can be assigned to this qualifier. * /synonym - used for other gene names still in use and to be displayed on the gene page * /obsolete_name - redundant gene names that can be queried but *are not visible on gene page* eg. errors * /primary_name - for published or agreed unique *user friendly gene name*, following the convention set out for kinetoplastids, will be the title for gene page. NB. This is *equivalent to* the EMBL-qualifier *standard_name* so it should be used "to give full gene name, but use /gene to give gene symbol". * /reserved_name - pre-publication names that will, presumably, become the primary_name. * Curation and comments: We can't at the moment store in GUS our curation and note qualifiers. There is a DoTS.Comments table but how can we attach comments to feature objects ? * Sres.BibliographicReference: How can they be associated with sequence and feature objects ? * Core.Algorithm table: Is this table only for registering plugins ? Can we use it as a controlled vocabulary table for storing software entries or it is not meant for this ? We'd like to store software entries, for example a GeneFeature predicted by geneid v1.2 or a SignalPeptide Feature generated by SignalP v3.0 * Similarity data: Can it be used for loading any similarity data, e.g. Fasta, BLAST, exonerate, Blat etc. I think it is possible but how can we store which software has been used (that refers to the former item). * Curation of homology data: Is it possible to specify that a orthologous or paralogous group has been manually curated (by adding a review_status_id attribute ?). cheers Arnaud |
From: Arnaud K. <ax...@sa...> - 2004-10-08 15:35:15
|
Ok, Steve Fischer wrote: > arnaud- > > ok, would you like to do the upgrade to LoadBlastSimFast? > So the plugin requires two extra parameters, '-queryIdAttr' and '-subjectIdAttr', is that right ? > if i understand correctly this would avoid having to modify the > schema, is that right? I think so. Another thing, the plugin requires 'use' statements for loading the sequence objects we want to attach similarity data to. Could we bypass somehow this declaration as in theory we would want to attach similarity data to any view on the top of NASequenceImp or AASequenceImp. By instanciating AASequence or NASequence superclass objects and using the subclass_view attribute to affect to the correct view the data, would it be feasible this way ? > > steve > > Arnaud Kerhornou wrote: > >> >> Steve Fischer wrote: >> >>> Arnaud- >>> >>> ok, i've looked at LoadBlastSimFast.pm. I see the addition of >>> the logic to use name to get the query object (and i know that we >>> discussed this in mail back in august). >>> >>> I am having some second thoughts about that change as it stands. The >>> original intent of the plugin was that the sequences submitted to >>> the blast process have been extracted from the database and >>> therefore have the primary key in their definition line. >>> >>> I think I understand that it would be useful to be able to skip that >>> step, ie, blast sequences using their native identifiers, and then >>> have the plugin discover what their internal primary key is. >>> That's what you want to do right? >>> >> that's right. >> >>> Does anybody know of any reason why that would not be ok? >>> >>> Assuming that nobody has any objections, maybe the best solution >>> would be to improve the plugin to take optional arguments that >>> specify the name of the query and/or subject identifier >>> attributes? For example: -queryIdAttr source_id. This would >>> give us full flexibility (and also avoid the slightly risky >>> assumption that a digits-only identifier must be the primary key) >>> >> that sounds sensible. >> >>> steve >>> >>> Arnaud Kerhornou wrote: >>> >>>> >>>> Steve Fischer wrote: >>>> >>>>> Arnaud- >>>>> >>>>> see below. >>>>> >>>>> steve >>>>> >>>>> Arnaud Kerhornou wrote: >>>>> >>>>>> Hi everyone >>>>>> >>>>>> To be able to reproduce the OrthoMCL method, I would like to >>>>>> raise two issues we've got: >>>>>> >>>>>> * The first issue relates to the view where are stored the >>>>>> protein sequences. I was thinking to use the TranslatedAASequence >>>>>> view as this one contains the translated sequences of our gene >>>>>> models. The problem I have is that it is missing a name attribute >>>>>> so I can not match the blast output query and subject names with >>>>>> the data into GUS (I didn't want to use the TranslatedAASequence >>>>>> primary keys as the identifiers of my proteins of interest). >>>>>> Could we add a name attribute to this view ? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> hmm. not quite following. what would this name be, where would >>>>> it be derived from? >>>> >>>> >>>> >>>> >>>> By default we assign the systematic id of the corresponding CDS to >>>> the protein name. >>>> >>>>> why not use source_id and/or secondary_identifier? >>>> >>>> >>>> >>>> >>>> We could do that, but in any case that would involve to modify the >>>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to >>>> get the sequences entries. At the moment the match is made on the >>>> primary key (which I want to avoid) or the name attribute. The >>>> source_id attribute would do instead of the name attribute. It must >>>> work for any blast (DNA Vs DNA or Protein Vs Protein) with the >>>> various potential GUS sequence objects we want to attach similarity >>>> data to. As far as I can see the source_id attribute is present in >>>> all of them (AASequenceImp and NASequenceImp tables). >>>> >>>>> or, presumably this translated sequence has a relationship back >>>>> to its na sequence (although i don't immediately see that in the >>>>> schema browser), so couldn't you get a name or source_id from there? >>>>> >>>> That would require a more sophisticate query to get the sequence >>>> entry. >>>> >>>>>> >>>>>> * The second issue relates to the BLAST output parsing, done by a >>>>>> module called BlastAnal.pm in the CBIL package. This module seems >>>>>> to parse BLAST output file with only one query sequence. I have >>>>>> more than one query sequence reported so I had to change the code >>>>>> of this module to allow more than one query sequence. Can my code >>>>>> be integrated to CBIL package ? Note that I didn't change the >>>>>> interface of this module so it doesn't affect the scripts that >>>>>> are using it, I'm thinking in particular of >>>>>> parseBlastFilesForSimilarity.pl >>>>> >>>>> >>>>> >>>>> >>>>> this sounds ok. how about we just take a quick look at this >>>>> together while you are visiting? then we can fold it into the >>>>> code base. do you want to send it by mail? >>>>> >>>> That's fine, the module is attached. >>>> >>>>>> >>>>>> cheers >>>>>> Arnaud >>>>>> >>>>>> > |
From: Steve F. <sfi...@pc...> - 2004-10-08 15:11:25
|
arnaud- ok, would you like to do the upgrade to LoadBlastSimFast? if i understand correctly this would avoid having to modify the schema, is that right? steve Arnaud Kerhornou wrote: > > Steve Fischer wrote: > >> Arnaud- >> >> ok, i've looked at LoadBlastSimFast.pm. I see the addition of the >> logic to use name to get the query object (and i know that we >> discussed this in mail back in august). >> >> I am having some second thoughts about that change as it stands. >> The original intent of the plugin was that the sequences submitted to >> the blast process have been extracted from the database and therefore >> have the primary key in their definition line. >> >> I think I understand that it would be useful to be able to skip that >> step, ie, blast sequences using their native identifiers, and then >> have the plugin discover what their internal primary key is. >> That's what you want to do right? >> > that's right. > >> Does anybody know of any reason why that would not be ok? >> >> Assuming that nobody has any objections, maybe the best solution >> would be to improve the plugin to take optional arguments that >> specify the name of the query and/or subject identifier >> attributes? For example: -queryIdAttr source_id. This would >> give us full flexibility (and also avoid the slightly risky >> assumption that a digits-only identifier must be the primary key) >> > that sounds sensible. > >> steve >> >> Arnaud Kerhornou wrote: >> >>> >>> Steve Fischer wrote: >>> >>>> Arnaud- >>>> >>>> see below. >>>> >>>> steve >>>> >>>> Arnaud Kerhornou wrote: >>>> >>>>> Hi everyone >>>>> >>>>> To be able to reproduce the OrthoMCL method, I would like to raise >>>>> two issues we've got: >>>>> >>>>> * The first issue relates to the view where are stored the protein >>>>> sequences. I was thinking to use the TranslatedAASequence view as >>>>> this one contains the translated sequences of our gene models. The >>>>> problem I have is that it is missing a name attribute so I can not >>>>> match the blast output query and subject names with the data into >>>>> GUS (I didn't want to use the TranslatedAASequence primary keys as >>>>> the identifiers of my proteins of interest). >>>>> Could we add a name attribute to this view ? >>>> >>>> >>>> >>>> >>>> hmm. not quite following. what would this name be, where would >>>> it be derived from? >>> >>> >>> >>> By default we assign the systematic id of the corresponding CDS to >>> the protein name. >>> >>>> why not use source_id and/or secondary_identifier? >>> >>> >>> >>> We could do that, but in any case that would involve to modify the >>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get >>> the sequences entries. At the moment the match is made on the >>> primary key (which I want to avoid) or the name attribute. The >>> source_id attribute would do instead of the name attribute. It must >>> work for any blast (DNA Vs DNA or Protein Vs Protein) with the >>> various potential GUS sequence objects we want to attach similarity >>> data to. As far as I can see the source_id attribute is present in >>> all of them (AASequenceImp and NASequenceImp tables). >>> >>>> or, presumably this translated sequence has a relationship back >>>> to its na sequence (although i don't immediately see that in the >>>> schema browser), so couldn't you get a name or source_id from there? >>>> >>> That would require a more sophisticate query to get the sequence entry. >>> >>>>> >>>>> * The second issue relates to the BLAST output parsing, done by a >>>>> module called BlastAnal.pm in the CBIL package. This module seems >>>>> to parse BLAST output file with only one query sequence. I have >>>>> more than one query sequence reported so I had to change the code >>>>> of this module to allow more than one query sequence. Can my code >>>>> be integrated to CBIL package ? Note that I didn't change the >>>>> interface of this module so it doesn't affect the scripts that are >>>>> using it, I'm thinking in particular of >>>>> parseBlastFilesForSimilarity.pl >>>> >>>> >>>> >>>> this sounds ok. how about we just take a quick look at this >>>> together while you are visiting? then we can fold it into the >>>> code base. do you want to send it by mail? >>>> >>> That's fine, the module is attached. >>> >>>>> >>>>> cheers >>>>> Arnaud >>>>> >>>>> >>>> >>> ------------------------------------------------------------------------ >>> >>> >>> #!/usr/bin/perl >>> >>> ## perl module for dealing with blast results... >>> ## will parse both --noseqs and normal blast. >>> ## does not parse out the alignments, rather meta information for >>> each subject and hsp >>> >>> ## Brian Brunk 5/17/99 >>> >>> package CBIL::Bio::Blast::BlastAnal; >>> >>> use CBIL::Bio::Blast::Subject; >>> use CBIL::Bio::Blast::HSP; >>> use CBIL::Bio::Blast::Chimera; >>> use CBIL::Bio::Blast::Repeat; >>> >>> use strict; >>> >>> my $debug = 0; >>> >>> ##simple constructor...need to make more robust if inheriting for >>> other purposes >>> sub new { >>> my $self = {}; >>> my $class = shift; >>> my $d = shift; >>> bless $self,$class; >>> $debug = defined $d ? $d : 0; >>> print STDERR "DEBUGGING ON\n" if $debug == 1; >>> $self->{"countSubjects"} = 0; >>> >>> # Added a hashtable storing the hits for each query sequence >>> my %tmp; >>> $self->{hits_by_queryId} = \%tmp; >>> $self->{queryIndex} = 0; >>> >>> return $self; >>> } >>> >>> >>> >>> >>> ##DataStructure returned by parseBlast...must have -noseqs on >>> commandline >>> ##new datastructure.... >>> ##{queryLength},{queryName} >>> ##{subjects} = array of Subject objects... >>> >>> ## takes in a blast result as an array and builds the entire >>> datastructure >>> sub parseBlast{ >>> my $self = shift; >>> my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) >>> = @_; >>> my %blast; >>> my >>> ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent); >>> >>> my $sbjct; >>> my $queryMatch; >>> my $haveQStart = 0; >>> my $haveSStart = 0; >>> my $parseType; >>> my $frame; >>> my $score; >>> my $identities; >>> my $positives; >>> >>> die 'these are the args to >>> parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' >>> if scalar(@{$blast}) <= 10; >>> foreach (@{$blast}) { >>> # print STDERR $_; >>> if (/^(\S*BLAST\S+)/){ ##gets the algorighm used for the query... >>> my $algo = $1; >>> print STDERR "algorithm = $algo\n" if $debug; >>> if($algo =~ /BLASTX/i){ >>> $parseType = 1; >>> }else{ >>> $parseType = 0; >>> } >>> print STDERR "ParseType='$parseType'\n" if $debug; >>> } >>> if (/^Query=\s*(\S+)/) { >>> # Storing Hits/HSPs for previous query Id >>> if (defined $self->{queryName}) { >>> my $countSubjects = $self->{countSubjects}; >>> my $subjects = $self->{subjects}; >>> my %hits = ( >>> 'countSubjects' => $countSubjects, >>> 'subjects' => $subjects >>> ); >>> my $hits_by_queryId = >>> $self->{hits_by_queryId}; >>> $hits_by_queryId->{$self->{queryName}} = \%hits; >>> } >>> $self->{queryName} = $1; ##query name/id >>> >>> # Reinitialisation >>> $self->{"countSubjects"} = 0 ; >>> my @tmp = (); >>> $self->{subjects} = \@tmp; } >>> elsif (/^\s+\((\S+)\sletters\)/) { >>> $self->{queryLength} = $1; ##query length >>> $self->{queryLength} =~ s/,//g; >>> } >>> if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { >>> ##ctxfactor catches the last one...Lambda last of rpsblast >>> my $sbjctId; >>> if (/^\>$regex/){ >>> $sbjctId = $1 ? $1 : $2; >>> }else{ >>> print STDERR "$self->{queryName}: Unable to match subject >>> using regex '$regex' on:\n $_" unless (/ctxfactor/ || /^Lambda/); >>> } >>> print STDERR "Matching subjectId = $sbjctId\n" if $debug; >>> ## print STDERR "REGEX matched $_"; >>> if ($haveQStart) { >>> print STDERR "Have last QStart:", $sbjct->getID()," >>> SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, >>> Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >>> ##have the complete HSP... >>> ##want to add only if meets minimum reqs....do this on an HSP >>> basis... >>> # print $sbjct->getID()," $sStart,$sEnd: >>> Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n"; >>> if ($matchLength >= $minLength && $matchPercent >= >>> $minPercent && $pValue <= $minPvalue) { >>> print STDERR "Match meets reqs\n" if $debug; >>> if($remMaskedFromLen){ ##want to remove Xs from the match >>> length.... >>> # print STDERR "removing X from match\n"; >>> # print STDERR "Before: $queryMatch\n"; >>> $queryMatch =~ s/X//g; >>> # print STDERR "After: $queryMatch\n"; >>> if(length($queryMatch) < 10){ ##don't remove if final >>> length < 10 >>> print STDERR "RemoveMaskedResiduesFromLength: >>> error...length = ".length($queryMatch)." following removal of X's\n"; >>> }else{ >>> # print "match before $matchLength\n"; >>> $matchLength = length($queryMatch); >>> # print "match after $matchLength\n"; >>> } >>> } >>> >>> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, >>> $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct; >>> } $haveQStart = 0; ##reset for next one... >>> $haveSStart = 0; >>> } if (defined $sbjct && $sbjct->countHSPs() >= 1 && >>> $sbjct->getID() ne $self->{"queryName"}) { >>> $self->addSubject($sbjct); >>> } >>> ##return heree if end... >>> return if (/ctxfactor/ || /^Lambda/); >>> ## print STDERR "New Subject $sbjctId\n"; >>> $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId; >>> ## $sbjct->setID($sbjctId); >>> } >>> if (/^\s+Length\s=\s(\S+)/) { >>> if($sbjct){ >>> my $sbjctLength = $1; >>> $sbjctLength =~ s/,//g; >>> $sbjct->setLength($sbjctLength); >>> $sbjct->setQueryLength($self->{"queryLength"}); ##set the >>> query length for use in } >>> ##end remaining calculations >>> } >>> if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && >>> /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) { >>> my $tmpScore = $1; >>> my $tmpValue = $2; >>> >>> ##need to add here if $haveQStart != 0; >>> if ($haveQStart) { >>> ##have the complete HSP... >>> ##want to add only if meets minimum reqs....do this on an HSP >>> basis... >>> print STDERR "Adding Sbjct (score) ",$sbjct->getID()," >>> SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, >>> Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >>> if ($matchLength >= $minLength && $matchPercent >= >>> $minPercent && $pValue <= $minPvalue && $sbjct) { >>> print STDERR "Adding Sbjct\n" if $debug; >>> if($remMaskedFromLen){ ##want to remove Xs from the match >>> length.... >>> # print STDERR "removing X from match\n"; >>> # print STDERR "Before: $queryMatch\n"; >>> $queryMatch =~ s/X//g; >>> # print STDERR "After: $queryMatch\n"; >>> if(length($queryMatch) < 10){ ##don't remove if final >>> length < 10 >>> print STDERR "RemoveMaskedResiduesFromLength: >>> error...length = ".length($queryMatch)." following removal of X's\n"; >>> }else{ >>> # print "match before $matchLength\n"; >>> $matchLength = length($queryMatch); >>> # print "match after $matchLength\n"; >>> } >>> } >>> >>> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, >>> $pValue, $dir, $frame, $score,$identities,$positives); } >>> $haveQStart = 0; ##reset for next one... >>> $haveSStart = 0; >>> } >>> $pValue = $tmpValue; >>> $score = $tmpScore; >>> } >>> ##get the strand in the Identities line as the following messes >>> things up!!! >>> ##get the strand...will work for all blast types... >>> # if (/^\s*(Minus|Plus)\sStrand/){ >>> # $strand = $1; >>> # } >>> ##following specific to blastx >>> if ($parseType && >>> /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) >>> { >>> $identities = $1; $matchLength = $2; $matchPercent = $3; >>> $positives = $4; $frame = $5; >>> $dir = $frame =~ /^\+/ ? 1 : 0; >>> ##following specific to blastn >>> }elsif >>> (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) >>> { >>> $identities = $1; $matchLength = $2; $matchPercent = $3; >>> $positives = $4; $strand = $5; >>> $dir = $strand eq 'Minus' ? 0 : 1; >>> ##following for blastp the default if others not matched... >>> }elsif >>> (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){ >>> $identities = $1; $matchLength = $2; $matchPercent = $3; >>> $positives = $4; >>> $dir = 1; ##always 1 for blastp which is what this catches... >>> }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast >>> $frame = $1; >>> $dir = $frame =~ /^\+/ ? 1 : 0; >>> } >>> >>> if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) { >>> print STDERR "Matching query: $1 $3\n" if $debug; >>> if ($haveQStart == 0) { >>> $queryMatch = ""; >>> $qStart = $1; $haveQStart = 1; >>> } >>> $qEnd = $3; >>> my $tmpQMatch = $2; >>> $tmpQMatch =~ s/\s//g; ##gets rid of any spaces >>> $queryMatch .= $tmpQMatch; >>> } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) { >>> if ($haveSStart == 0) { >>> $sStart = $1; $haveSStart = 1; >>> } >>> $sEnd = $2; >>> } >>> } >>> >>> # Storing Hits/HSPs for the last query Id >>> if (defined $self->{queryName}) { >>> my $countSubjects = $self->{countSubjects}; >>> my $subjects = $self->{subjects}; >>> my %hits = ( >>> 'countSubjects' => $countSubjects, >>> 'subjects' => $subjects >>> ); >>> my $hits_by_queryId = $self->{hits_by_queryId}; >>> $hits_by_queryId->{$self->{queryName}} = \%hits; >>> } >>> >>> return; >>> } >>> sub setQueryLength { >>> my($self,$l) = @_; >>> $self->{queryLength} = $l; >>> } >>> >>> sub getQueryLength { >>> my($self) = @_; >>> return $self->{queryLength}; >>> } >>> >>> sub sortSubjectsByMinQueryStart{ >>> my $self = shift; >>> if (!exists $self->{"sortByQStart"}) { >>> @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> >>> $b->getMinQueryStart()} $self->getSubjects(); >>> } >>> return @{$self->{"sortByQStart"}}; >>> } >>> >>> sub sortSubjectsByMaxQueryEnd{ >>> my $self = shift; >>> if (!exists $self->{"sortByQEnd"}) { >>> @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> >>> $b->getMaxQueryEnd()} $self->getSubjects(); >>> } >>> return @{$self->{"sortByQEnd"}}; >>> } >>> >>> ##detecting chimeras >>> ############################################################ >>> ## Breakpoint = 0 * OK >>> ## Breakpoint = 1 > [Ch, AS, OK, !] >>> ## Dangling Ends = none > [Ch, AS, OK] ## Spanned = 1 >>> * AS, OK >>> ## Spanned = 0 > [Ch, AS, OK] >>> ## Deep = 1 * Ch >>> ## Deep = 0 * Ch, AS, OK (review) >>> ## Dangling Ends = uni > [Ch, As] >>> ## Spanned = 1 > [Ch, AS] >>> ## QvsSub = 1 * Ch >>> ## QvsSub = 0 * AS >>> ## Spanned = 0 > [Ch, AS] >>> ## Deep = 1 * Ch >>> ## Deep = 0 * Ch, AS (review) >>> ## Dangling Ends = bi > [Ch, !] >>> ## Spanned = 1 > [Ch, !] >>> ## QvsSub = 1 * Ch >>> ## QvsSub = 0 * ! (review) ##not sure about this >>> one....should it be OK >>> ## Spanned = 0 * Ch >>> ############################################################ >>> ##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 >>> dangling, 4=review 2 dangling ends. >>> ##add to ichimera obj... >>> sub detectChimera{ >>> my($self,$slop) = @_; ##$slop is the amount of >>> slop in the breakpoint de to blast >>> ##loop through each possible chimera breakpoint and test if chimera >>> my $qvssub = 1; >>> my @span; >>> foreach my $c ($self->findChimeraBreakpoints($slop)) { >>> @span = >>> $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop); >>> >>> $c->{"numberSpanned"} = scalar(@span); >>> >>> ##first dangling == 0 >>> if ($c->getDangling($slop) == 0) { ##dangle by at least $slop >>> if (scalar(@span) > 0) { >>> $c->{"isChimera"} = 0; ##spans and no dangling ends >>> therefore OK >>> } else { >>> if ($c->getDepth() == 1) { >>> $c->{"isChimera"} = 1; >>> } else { >>> $c->{"isChimera"} = 2; ##will requiere review >>> } >>> } >>> ##dangling on only one end... >>> } elsif ($c->getDangling($slop) == 1) { >>> if (scalar(@span) > 0) { >>> $qvssub = 1; ##are very >>> similar....looking for one that isn't... >>> foreach my $sp (@span) { >>> if ($sp->compareQueryVsSubject() == 0) { >>> ##good subject spanning....looks lke should be >>> alternative splicing >>> $c->{"isChimera"} = 0; >>> $qvssub = 0; >>> } >>> } >>> $c->{"isChimera"} = 1 if $qvssub == 1; } else { >>> if ($c->getDepth() == 1) { >>> $c->{"isChimera"} = 1; >>> } else { #will need to review.... >>> $c->{"isChimera"} = 3; >>> } >>> } >>> ##dangling on two ends... >>> } elsif ($c->getDangling($slop) == 2) { >>> if (scalar(@span) > 0) { >>> $qvssub = 1; ##are very >>> similar....looking for one that isn't... >>> foreach my $sp (@span) { >>> if ($sp->compareQueryVsSubject() == 0) { >>> ##unclear here....need to review: >>> $c->{"isChimera"} = 4; >>> $qvssub = 0; >>> ## last; >>> } >>> } >>> $c->{"isChimera"} = 1 if $qvssub == 1; } else { >>> $c->{"isChimera"} = 1; ##is a chimera >>> } >>> } else { print "ERROR: getDangling returned invalid >>> value\n"; >>> } >>> } >>> ##now need to go through each of chimeras and output results... >>> my $maxChim = 0; >>> my $isChimera = 0; >>> my $errString = ""; >>> if (exists $self->{"chimeras"}) { >>> ## $errString = "Chimeras for $self->{'queryName'}\n"; >>> my $cnt = 1; >>> foreach my $chim (@{$self->{"chimeras"}}) { >>> last if $chim eq ""; ##this is just a holder that >>> indicates that the findChimera >>> ##method has been run and there are not chimeras... >>> $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > >>> $maxChim; >>> $isChimera = 1 if $chim->{'isChimera'} == 1; >>> $errString .= " Chimera $cnt: $chim->{'isChimera'}, >>> danglingEnds=".$chim->getDangling($slop).", >>> numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", >>> numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", >>> breakpoint=".$chim->getBreakpoint()."\n"; >>> # $errString .= $chim->toString(); >>> $cnt++; >>> } >>> } >>> return ($isChimera,$maxChim,$errString); >>> } >>> >>> sub getChimeras { >>> my($self,$slop) = @_; >>> if(!exists $self->{'chimeras'}){ >>> $self->detectChimera($slop); >>> } >>> return @{$self->{'chimeras'}}; >>> } >>> >>> ## returns array of Chimera objs >>> sub findChimeraBreakpoints{ >>> my($self,$slop) = @_; >>> if ($self->{'findChimera'} != 1) { >>> my @l = $self->sortSubjectsByMaxQueryEnd(); >>> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want >>> right one reversed >>> my $lindex = 0; >>> my $rindex = 0; >>> for (my $li = $lindex;$li < scalar(@l);$li++) { >>> last if $l[$li]->getMaxQueryEnd() - >>> $r[$rindex]->getMinQueryStart() > $slop; >>> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >>> last if $l[$li]->getMaxQueryEnd() - >>> $r[$ri]->getMinQueryStart() > $slop; >>> if (abs($l[$li]->getMaxQueryEnd() - >>> $r[$ri]->getMinQueryStart()) <= $slop) { >>> ##create new chimera object and pass arrays into extendBrkpt() >>> ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back >>> the index of last l and r >>> last; >>> } >>> } >>> } >>> $self->{'findChimera'} = 1; ##indicates that have done the >>> analysis >>> } >>> if (!exists $self->{'chimeras'}) { >>> return; >>> } >>> return @{$self->{"chimeras"}}; >>> } >>> >>> sub buildChimera{ >>> my($self,$lindex,$rindex,$slop) = @_; >>> my @l = $self->sortSubjectsByMaxQueryEnd(); >>> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right >>> one reversed >>> my $chim = CBIL::Bio::Blast::Chimera->new(); >>> ##now add the current and any others to the chimera >>> ## the first time in loop will add the current one to $chim >>> ##first do left >>> my $ledge = $l[$lindex]->getMaxQueryEnd(); >>> for (my $li = $lindex;$li < scalar(@l);$li++) { >>> if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) { >>> $chim->addLeftSubject($l[$li]); >>> } else { >>> $lindex = $li; >>> last; >>> } >>> $lindex = $li; >>> } >>> ##then do right >>> my $redge = $r[$rindex]->getMinQueryStart(); >>> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >>> if ($redge - $r[$ri]->getMinQueryStart() <= $slop) { >>> $chim->addRightSubject($r[$ri]); } else { >>> $rindex = $ri; >>> last; >>> } >>> $rindex = $ri; >>> } >>> push(@{$self->{"chimeras"}},$chim); >>> return($lindex,$rindex); >>> } >>> >>> ##returns array of subjects >>> ##$slop is amount by which it needs to extend beyond the brkpt. >>> sub findSubjectsSpanningChimeraBreakpoint{ >>> my($self,$loc,$slop) = @_; >>> my @span; >>> foreach my $s ($self->sortSubjectsByMinQueryStart()) { >>> last if $s->getMinQueryStart() > $loc - $slop; >>> if ($s->getMaxQueryEnd() > $loc + $slop) { >>> ## print "Spans $loc: ",$s->getMinQueryStart(), >>> "-",$s->getMaxQueryEnd(),"\n"; >>> push(@span,$s); >>> } >>> } >>> return @span; >>> } >>> >>> ######################################## >>> ## Detecting unblocked repeats... >>> ## If No stack - OK >>> ## ## If Stack >>> ## ## If No overhanging ends in query then OK (query is subsumed >>> by subjects) >>> ## -will block the query sequences and throw out sequences >>> that do not have >>> ## >50 bp of un-blocked sequence >>> ## ## If Overhanging ends in query (may be only one if repeat is >>> at end) >>> ## ## If 1 HSP then likely REPEAT => ignore all matches that >>> don't extend >>> ## outside this region. >>> ## ## Overhanging ends in subjects = 1 => >>> definitely REPEAT >>> ## ## Overhanging ends in subjects = 0 => ??very unlikely >>> scenario..not clear >>> ## ## If >1 HSP (could be multiple copies of same repeat or >>> perhaps alt. splicing >>> ## OR could it be a valid match that also has a repeat in >>> it...no should in >>> ## this case match entirely and have just 1 HSP) >>> ## ## If subject locations overlap/superimpose then REPEAT >>> ## =ignore all matches that do not extend outside these >>> locations >>> ## between them would be "outside" the locations >>> ## ## If subject locations distinct then likely >>> alternative splicing >>> ## OR alternatively the query and subject could contain >>> two distinct >>> ## different repeats (I will be blocking for repeats >>> so the likelihood >>> ## that there are two different repeats that are both >>> novel is small). >>> ## ## ?do I want to do further analysis to detemine >>> consistency with >>> ## alternative splicing model before ruling OK...? >>> ## ######################################## >>> >>> ##this method is the call to find repeats...should return an array >>> of subjects to >>> ##ignore... >>> >>> ## find 5' edge (minQueryStart) and gather up then check to see if >>> there is a ## correlation with 3' (maxQueryEnd) >>> ## edge...if so is potential repeat and need to check the above rules.. >>> >>> #what to return here....region to ignore when determining good >>> matches... >>> >>> sub detectRepeats { >>> my($self,$slop) = @_; >>> >>> if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt >>> repeat... >>> return; >>> } >>> >>> my @rep = $self->findRepeats($slop); ##array of repeats.... >>> >>> if ($debug == 1) { >>> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >>> foreach my $r (@rep) { >>> print STDERR $r->toString(1),"\n"; >>> } >>> } >>> >>> ##want to merge repeats if they are overlapping... >>> my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep; >>> undef @rep; ##use for merged >>> ones... >>> my $merge = 0; >>> my $nr; >>> for (my $i = 0;$i<scalar(@sort) - 1;$i++) { >>> if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) { >>> ##are overlapping...merge >>> if ($merge == 0) { >>> $nr = CBIL::Bio::Blast::Repeat->new(); >>> $merge = 1; >>> } >>> foreach my $s ($sort[$i]->getSubjects()) { >>> $nr->addSubject($s); >>> } >>> } elsif ($merge == 1) { >>> $merge = 0; >>> foreach my $s ($sort[$i]->getSubjects()) { >>> $nr->addSubject($s); >>> } >>> push(@rep,$nr); >>> } else { >>> ##does not overlap so just add to rep >>> push(@rep,$sort[$i]); >>> } >>> } >>> ##need to do last if $merge == 1; >>> if ($merge == 1) { ##need to add the i+1 >>> (last) case >>> foreach my $s ($sort[-1]->getSubjects()) { >>> $nr->addSubject($s); >>> } >>> push(@rep,$nr); >>> } >>> ##repeats merged and should be non-overlapping >>> >>> if ($debug == 1) { >>> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >>> foreach my $r (@rep) { >>> print STDERR $r->toString(1),"\n"; >>> } >>> } >>> return @rep; >>> } >>> >>> ##finds all repeat stacks.. >>> # populates array of subjects that are repeats ... >>> @{$self->{'repeats'}} >>> >>> sub findRepeats{ >>> my($self,$slop) = @_; >>> my $rep; >>> if (!exists $self->{'repeats'}) { >>> my $firstElement = 1; >>> my $inRepeat = 0; >>> my @s = $self->sortSubjectsByMinQueryStart(); >>> ###NOTE: want to only take repeat from a single point rather >>> than each minQuery start >>> my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge >>> for (my $i = 0;$i < scalar(@s) - 1;$i++) { if >>> ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) { >>> if ($firstElement == 1) { >>> $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat >>> object.. >>> $firstElement = 0; >>> $inRepeat = 1; >>> } >>> print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" >>> if $debug == 1; >>> $rep->addSubject($s[$i]); >>> if ($i == scalar(@s) - 2) { ##need to add next one as am >>> in last loop >>> $rep->addSubject($s[$i+1]); >>> my $verified = $rep->verifyRepeat($slop); ##does all the >>> verification that this is a repeat >>> push(@{$self->{'repeats'}},$rep) if $verified == 1; >>> } >>> } elsif ($inRepeat == 1) { ##need this to get the $i+1 >>> as am working one ahead >>> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >>> $firstElement = 1; ##now looking for the first >>> element >>> $inRepeat = 0; >>> ##add the last one nd then verify repeats....Repeat method.. >>> ##if really a repeat then add to @$self->{'repeats'} >>> $rep->addSubject($s[$i]); >>> print STDERR "Repeat contains ",$rep->countSubjects(), " >>> subjects\n" if $debug == 1; >>> $rep->verifyRepeat($slop); ##does all the verification that >>> this is a repeat >>> push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 >>> && $rep->getVerified() == 1); >>> } else { >>> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >>> $firstElement = 1; ##now looking for the first >>> element >>> } >>> } >>> if (!exists $self->{'repeats'}) { ##there aren't any....make 0 >>> $self->{'repeats'}->[0] = ""; >>> } >>> } >>> if ($self->{'repeats'}->[0] eq "") { >>> return; >>> } >>> return @{$self->{'repeats'}}; >>> } >>> >>> ##NOTE: the repeat object is initially populated with things that >>> have a common >>> ## left edge. It is the job of the Repeat to then verify that this >>> constitutes >>> ## a repeat and remove any subjects that don't meet ehe rules for a >>> repeat stack above >>> #3will need a 'verified' bit to know if the repeat bj is >>> verified...else verify >>> ## before doing anyhing else. >>> >>> >>> sub getQuery5endMis{ >>> my $self = shift; >>> if (!exists $self->{"query5pMis"}) { >>> $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest >>> foreach my $s ($self->getSubjects()) { >>> if ($self->{"query5pMis"} > $s->getQuery5endMis()) { >>> $self->{"query5pMis"} = $s->getQuery5endMis(); >>> } } >>> } >>> return $self->{"query5pMis"}; >>> } >>> >>> sub getQuery3endMis{ >>> my $self = shift; >>> if (!exists $self->{"query3pMis"}) { >>> $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest >>> foreach my $s ($self->getSubjects()) { >>> if ($self->{"query3pMis"} > $s->getQuery3endMis()) { >>> $self->{"query3pMis"} = $s->getQuery3endMis(); >>> } } >>> } >>> return $self->{"query3pMis"}; >>> } >>> >>> ##returns 0 if leftover at ends and 1 if covered to ends >>> sub getCoveredBySubjects{ >>> my $self = shift; >>> my $left = shift; ##amount to be >>> leftover.... >>> if (! $left) { $left = 20; } >>> if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() >>> <= $left) { >>> return 1; >>> } >>> return 0; >>> } >>> >>> sub analyzeSubjectsBySubject{ >>> my $self = shift; >>> return if $self->getSubjectCount() == 0; >>> my @singletons; >>> my %multiples; >>> my $haveImmuno = 0; ##keeps track if is >>> immunoglobulin gene... >>> my $order = 0; ##order >>> my $orientation = 0; ##orientation >>> my $subOver = 0; ##number of subject >>> overlaps >>> my $subGaps = 0; ##number of subject Gaps >>> my @multiples; >>> foreach my $s ($self->getSubjects()) { >>> my $id = $s->getID(); >>> # if($id <= 100009){ >>> # $haveImmuno = 1; >>> # next; ##don't want to include these in the analysis >>> # } >>> if ($s->countHSPs() == 1) { ##don't print out singletons HSPS >>> push(@singletons,$id); >>> next; >>> } >>> push(@multiples,$id); >>> print "\n\>DT.",$id,", Length=",$s->getLength(),", Num >>> HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient >>> Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1; >>> ($subOver,$subGaps) = $s->checkSubjectGaps(); >>> print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" >>> if $debug == 1; >>> $order = $s->checkOrderConsistency(); >>> $orientation = $s->checkOrientationConsistency(); >>> print STDERR $s->toString() if $order + $orientation != 2; >>> $multiples{$id} = >>> [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver >>> == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples! >>> } >>> print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1; >>> >>> # if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";} >>> return(\@singletons,\%multiples,$haveImmuno); >>> } >>> >>> sub getStats{ >>> my $self = shift; >>> my $minNum = shift; ##minimum number of HSPs >>> my $ret; >>> # $minNum eq "" ? $minNum = 1 : $minNum; >>> print "QueryName = $self->{queryName}, QueryLength = >>> $self->{queryLength}\n"; >>> if ($self->{"countSubjects"} == 0) { >>> $ret = "\tNo Hits\n"; >>> return $ret; >>> } >>> foreach my $s ($self->getSubjects()) { >>> next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS >>> $ret .= $s->toString(); >>> } >>> return $ret; >>> } >>> >>> sub printStats { >>> my($self,$minNum) = @_; >>> print $self->getStats($minNum); >>> } >>> >>> sub addSubject{ >>> my($self,$sbjct) = @_; >>> push(@{$self->{subjects}},$sbjct); >>> $self->{"countSubjects"}++; >>> } >>> >>> sub getSubjectCount{ >>> my $self = shift; >>> # return $self->{"countSubjects"}; >>> >>> my $queryId = shift; >>> my $hits_by_queryId = $self->{hits_by_queryId}; >>> my $hits = $hits_by_queryId->{$queryId}; >>> return $hits->{countSubjects}; >>> } >>> >>> sub getSubjects{ >>> my $self = shift; >>> # return if $self->{"countSubjects"} == 0; >>> # return @{$self->{subjects}}; >>> >>> my $queryId = shift; >>> my $hits_by_queryId = $self->{hits_by_queryId}; >>> my $hits = $hits_by_queryId->{$queryId}; >>> my $count = $hits->{countSubjects}; >>> my $subjects = $hits->{subjects}; >>> >>> return if ($count == 0); return @$subjects; >>> } >>> >>> sub getQueryID{ >>> my $self = shift; # return $self->{"queryName"}; >>> >>> my $index = $self->{queryIndex}; >>> my $hits_by_queryId = $self->{hits_by_queryId}; >>> my @queryIds = keys (%$hits_by_queryId); >>> if ($index < @queryIds) { >>> $self->{queryIndex} = $index + 1; >>> my $queryId = $queryIds[$index]; >>> return $queryId; >>> } >>> else { >>> return undef; >>> } } >>> >>> sub printThreshHoldHits { >>> my $self = shift; >>> return if $self->{"countSubjects"} == 0; >>> foreach my $s ($self->getSubjects()) { >>> print $self->getQueryID(), "\t",$s->getID(),"\n"; >>> } >>> } >>> >>> sub printGeneModels{ >>> my $self = shift; >>> my @cgm; >>> print "QueryName = $self->{queryName}, QueryLength = >>> $self->{queryLength}\n"; >>> if ($self->{"countSubjects"} == 0) { >>> print "\tNo Hits\n"; >>> return; >>> } >>> foreach my $s ($self->getSubjects()) { >>> print "\n",$s->toString(); >>> @cgm = $s->getConsistentGeneModel(); >>> print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : >>> "No"), ", ",scalar(@cgm), " HSPs\n"; >>> foreach my $h (@cgm) { >>> print " ",$h->toString(); >>> } >>> } >>> } >>> >>> sub compareGeneModelWithTotCons{ >>> my $self = shift; >>> # print "\nQueryName = $self->{queryName}, QueryLength = >>> $self->{queryLength}\n"; >>> my $cgmgt1 = 0; >>> my $corCGM = 0; >>> my $totCor = 0; >>> my $totSubs = 0; >>> my @vals; >>> foreach my $s ($self->getSubjects()) { >>> next if $s->countHSPs() < 2; ##don't do singletons >>> $totSubs++; >>> # print "\n",$s->toString(); >>> # print $s->getID(),": ",$s->countHSPs()," HSPs\n"; >>> # print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? >>> "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n"; >>> my($subOver,$subGaps) = $s->checkSubjectGaps(); >>> @vals = >>> ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver >>> == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0); >>> # print " Old Analysis: Order=$vals[0], Orient=$vals[1], >>> Over=$vals[2], Gaps=$vals[3]\n"; >>> $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4; >>> $corCGM += $s->checkGoodCGM(); >>> $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1; >>> } >>> return($totSubs,$corCGM,$totCor,$cgmgt1); >>> } >>> >>> >>> 1; >>> >>> |
From: Arnaud K. <ax...@sa...> - 2004-10-08 13:41:02
|
Steve Fischer wrote: > Arnaud- > > ok, i've looked at LoadBlastSimFast.pm. I see the addition of the > logic to use name to get the query object (and i know that we > discussed this in mail back in august). > > I am having some second thoughts about that change as it stands. > > The original intent of the plugin was that the sequences submitted to > the blast process have been extracted from the database and therefore > have the primary key in their definition line. > > I think I understand that it would be useful to be able to skip that > step, ie, blast sequences using their native identifiers, and then > have the plugin discover what their internal primary key is. That's > what you want to do right? > that's right. > Does anybody know of any reason why that would not be ok? > > Assuming that nobody has any objections, maybe the best solution would > be to improve the plugin to take optional arguments that specify the > name of the query and/or subject identifier attributes? For > example: -queryIdAttr source_id. This would give us full > flexibility (and also avoid the slightly risky assumption that a > digits-only identifier must be the primary key) > that sounds sensible. > steve > > Arnaud Kerhornou wrote: > >> >> Steve Fischer wrote: >> >>> Arnaud- >>> >>> see below. >>> >>> steve >>> >>> Arnaud Kerhornou wrote: >>> >>>> Hi everyone >>>> >>>> To be able to reproduce the OrthoMCL method, I would like to raise >>>> two issues we've got: >>>> >>>> * The first issue relates to the view where are stored the protein >>>> sequences. I was thinking to use the TranslatedAASequence view as >>>> this one contains the translated sequences of our gene models. The >>>> problem I have is that it is missing a name attribute so I can not >>>> match the blast output query and subject names with the data into >>>> GUS (I didn't want to use the TranslatedAASequence primary keys as >>>> the identifiers of my proteins of interest). >>>> Could we add a name attribute to this view ? >>> >>> >>> >>> hmm. not quite following. what would this name be, where would >>> it be derived from? >> >> >> By default we assign the systematic id of the corresponding CDS to >> the protein name. >> >>> why not use source_id and/or secondary_identifier? >> >> >> We could do that, but in any case that would involve to modify the >> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get >> the sequences entries. At the moment the match is made on the primary >> key (which I want to avoid) or the name attribute. The source_id >> attribute would do instead of the name attribute. It must work for >> any blast (DNA Vs DNA or Protein Vs Protein) with the various >> potential GUS sequence objects we want to attach similarity data to. >> As far as I can see the source_id attribute is present in all of them >> (AASequenceImp and NASequenceImp tables). >> >>> or, presumably this translated sequence has a relationship back to >>> its na sequence (although i don't immediately see that in the schema >>> browser), so couldn't you get a name or source_id from there? >>> >> That would require a more sophisticate query to get the sequence entry. >> >>>> >>>> * The second issue relates to the BLAST output parsing, done by a >>>> module called BlastAnal.pm in the CBIL package. This module seems >>>> to parse BLAST output file with only one query sequence. I have >>>> more than one query sequence reported so I had to change the code >>>> of this module to allow more than one query sequence. Can my code >>>> be integrated to CBIL package ? Note that I didn't change the >>>> interface of this module so it doesn't affect the scripts that are >>>> using it, I'm thinking in particular of >>>> parseBlastFilesForSimilarity.pl >>> >>> >>> this sounds ok. how about we just take a quick look at this >>> together while you are visiting? then we can fold it into the code >>> base. do you want to send it by mail? >>> >> That's fine, the module is attached. >> >>>> >>>> cheers >>>> Arnaud >>>> >>>> >>> >>------------------------------------------------------------------------ >> >>#!/usr/bin/perl >> >>## perl module for dealing with blast results... >>## will parse both --noseqs and normal blast. >>## does not parse out the alignments, rather meta information for each subject and hsp >> >>## Brian Brunk 5/17/99 >> >>package CBIL::Bio::Blast::BlastAnal; >> >>use CBIL::Bio::Blast::Subject; >>use CBIL::Bio::Blast::HSP; >>use CBIL::Bio::Blast::Chimera; >>use CBIL::Bio::Blast::Repeat; >> >>use strict; >> >>my $debug = 0; >> >>##simple constructor...need to make more robust if inheriting for other purposes >>sub new { >> my $self = {}; >> my $class = shift; >> my $d = shift; >> bless $self,$class; >> $debug = defined $d ? $d : 0; >> print STDERR "DEBUGGING ON\n" if $debug == 1; >> $self->{"countSubjects"} = 0; >> >> # Added a hashtable storing the hits for each query sequence >> my %tmp; >> $self->{hits_by_queryId} = \%tmp; >> $self->{queryIndex} = 0; >> >> return $self; >>} >> >> >> >> >>##DataStructure returned by parseBlast...must have -noseqs on commandline >>##new datastructure.... >>##{queryLength},{queryName} >>##{subjects} = array of Subject objects... >> >>## takes in a blast result as an array and builds the entire datastructure >>sub parseBlast{ >> my $self = shift; >> my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) = @_; >> my %blast; >> my ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent); >> my $sbjct; >> my $queryMatch; >> my $haveQStart = 0; >> my $haveSStart = 0; >> my $parseType; >> my $frame; >> my $score; >> my $identities; >> my $positives; >> >> die 'these are the args to parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' if scalar(@{$blast}) <= 10; >> >> foreach (@{$blast}) { >> # print STDERR $_; >> if (/^(\S*BLAST\S+)/){ ##gets the algorighm used for the query... >> my $algo = $1; >> print STDERR "algorithm = $algo\n" if $debug; >> if($algo =~ /BLASTX/i){ >> $parseType = 1; >> }else{ >> $parseType = 0; >> } >> print STDERR "ParseType='$parseType'\n" if $debug; >> } >> if (/^Query=\s*(\S+)/) { >> >> # Storing Hits/HSPs for previous query Id >> >> if (defined $self->{queryName}) { >> my $countSubjects = $self->{countSubjects}; >> my $subjects = $self->{subjects}; >> my %hits = ( >> 'countSubjects' => $countSubjects, >> 'subjects' => $subjects >> ); >> >> my $hits_by_queryId = $self->{hits_by_queryId}; >> $hits_by_queryId->{$self->{queryName}} = \%hits; >> } >> >> $self->{queryName} = $1; ##query name/id >> >> # Reinitialisation >> $self->{"countSubjects"} = 0 ; >> my @tmp = (); >> $self->{subjects} = \@tmp; >> >> } elsif (/^\s+\((\S+)\sletters\)/) { >> $self->{queryLength} = $1; ##query length >> $self->{queryLength} =~ s/,//g; >> } >> if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { ##ctxfactor catches the last one...Lambda last of rpsblast >> my $sbjctId; >> if (/^\>$regex/){ >> $sbjctId = $1 ? $1 : $2; >> }else{ >> print STDERR "$self->{queryName}: Unable to match subject using regex '$regex' on:\n $_" unless (/ctxfactor/ || /^Lambda/); >> } >> print STDERR "Matching subjectId = $sbjctId\n" if $debug; >> ## print STDERR "REGEX matched $_"; >> if ($haveQStart) { >> print STDERR "Have last QStart:", $sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >> ##have the complete HSP... >> ##want to add only if meets minimum reqs....do this on an HSP basis... >> # print $sbjct->getID()," $sStart,$sEnd: Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n"; >> if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue) { >> print STDERR "Match meets reqs\n" if $debug; >> if($remMaskedFromLen){ ##want to remove Xs from the match length.... >># print STDERR "removing X from match\n"; >># print STDERR "Before: $queryMatch\n"; >> $queryMatch =~ s/X//g; >># print STDERR "After: $queryMatch\n"; >> if(length($queryMatch) < 10){ ##don't remove if final length < 10 >> print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; >> }else{ >># print "match before $matchLength\n"; >> $matchLength = length($queryMatch); >># print "match after $matchLength\n"; >> } >> } >> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct; >> } >> $haveQStart = 0; ##reset for next one... >> $haveSStart = 0; >> } >> if (defined $sbjct && $sbjct->countHSPs() >= 1 && $sbjct->getID() ne $self->{"queryName"}) { >> $self->addSubject($sbjct); >> } >> ##return heree if end... >> return if (/ctxfactor/ || /^Lambda/); >> >> ## print STDERR "New Subject $sbjctId\n"; >> $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId; >> ## $sbjct->setID($sbjctId); >> >> } >> if (/^\s+Length\s=\s(\S+)/) { >> if($sbjct){ >> my $sbjctLength = $1; >> $sbjctLength =~ s/,//g; >> $sbjct->setLength($sbjctLength); >> $sbjct->setQueryLength($self->{"queryLength"}); ##set the query length for use in >> } >> ##end remaining calculations >> } >> if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) { >> my $tmpScore = $1; >> my $tmpValue = $2; >> >> ##need to add here if $haveQStart != 0; >> if ($haveQStart) { >> ##have the complete HSP... >> ##want to add only if meets minimum reqs....do this on an HSP basis... >> print STDERR "Adding Sbjct (score) ",$sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >> if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue && $sbjct) { >> print STDERR "Adding Sbjct\n" if $debug; >> if($remMaskedFromLen){ ##want to remove Xs from the match length.... >># print STDERR "removing X from match\n"; >># print STDERR "Before: $queryMatch\n"; >> $queryMatch =~ s/X//g; >># print STDERR "After: $queryMatch\n"; >> if(length($queryMatch) < 10){ ##don't remove if final length < 10 >> print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; >> }else{ >># print "match before $matchLength\n"; >> $matchLength = length($queryMatch); >># print "match after $matchLength\n"; >> } >> } >> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives); >> } >> $haveQStart = 0; ##reset for next one... >> $haveSStart = 0; >> } >> $pValue = $tmpValue; >> $score = $tmpScore; >> } >> ##get the strand in the Identities line as the following messes things up!!! >> ##get the strand...will work for all blast types... >># if (/^\s*(Minus|Plus)\sStrand/){ >># $strand = $1; >># } >> ##following specific to blastx >> if ($parseType && /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) { >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $frame = $5; >> $dir = $frame =~ /^\+/ ? 1 : 0; >> ##following specific to blastn >> }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) { >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $strand = $5; >> $dir = $strand eq 'Minus' ? 0 : 1; >> ##following for blastp the default if others not matched... >> }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){ >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; >> $dir = 1; ##always 1 for blastp which is what this catches... >> }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast >> $frame = $1; >> $dir = $frame =~ /^\+/ ? 1 : 0; >> } >> >> if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) { >> print STDERR "Matching query: $1 $3\n" if $debug; >> if ($haveQStart == 0) { >> $queryMatch = ""; >> $qStart = $1; >> $haveQStart = 1; >> } >> $qEnd = $3; >> my $tmpQMatch = $2; >> $tmpQMatch =~ s/\s//g; ##gets rid of any spaces >> $queryMatch .= $tmpQMatch; >> } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) { >> if ($haveSStart == 0) { >> $sStart = $1; >> $haveSStart = 1; >> } >> $sEnd = $2; >> } >> } >> >> # Storing Hits/HSPs for the last query Id >> >> if (defined $self->{queryName}) { >> my $countSubjects = $self->{countSubjects}; >> my $subjects = $self->{subjects}; >> my %hits = ( >> 'countSubjects' => $countSubjects, >> 'subjects' => $subjects >> ); >> my $hits_by_queryId = $self->{hits_by_queryId}; >> $hits_by_queryId->{$self->{queryName}} = \%hits; >> } >> >> return; >>} >> >>sub setQueryLength { >> my($self,$l) = @_; >> $self->{queryLength} = $l; >>} >> >>sub getQueryLength { >> my($self) = @_; >> return $self->{queryLength}; >>} >> >>sub sortSubjectsByMinQueryStart{ >> my $self = shift; >> if (!exists $self->{"sortByQStart"}) { >> @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> $b->getMinQueryStart()} $self->getSubjects(); >> } >> return @{$self->{"sortByQStart"}}; >>} >> >>sub sortSubjectsByMaxQueryEnd{ >> my $self = shift; >> if (!exists $self->{"sortByQEnd"}) { >> @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> $b->getMaxQueryEnd()} $self->getSubjects(); >> } >> return @{$self->{"sortByQEnd"}}; >>} >> >>##detecting chimeras >>############################################################ >>## Breakpoint = 0 * OK >>## Breakpoint = 1 > [Ch, AS, OK, !] >>## Dangling Ends = none > [Ch, AS, OK] >>## Spanned = 1 * AS, OK >>## Spanned = 0 > [Ch, AS, OK] >>## Deep = 1 * Ch >>## Deep = 0 * Ch, AS, OK (review) >>## Dangling Ends = uni > [Ch, As] >>## Spanned = 1 > [Ch, AS] >>## QvsSub = 1 * Ch >>## QvsSub = 0 * AS >>## Spanned = 0 > [Ch, AS] >>## Deep = 1 * Ch >>## Deep = 0 * Ch, AS (review) >>## Dangling Ends = bi > [Ch, !] >>## Spanned = 1 > [Ch, !] >>## QvsSub = 1 * Ch >>## QvsSub = 0 * ! (review) ##not sure about this one....should it be OK >>## Spanned = 0 * Ch >>############################################################ >> >>##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 dangling, 4=review 2 dangling ends. >>##add to ichimera obj... >>sub detectChimera{ >> my($self,$slop) = @_; ##$slop is the amount of slop in the breakpoint de to blast >> ##loop through each possible chimera breakpoint and test if chimera >> my $qvssub = 1; >> my @span; >> foreach my $c ($self->findChimeraBreakpoints($slop)) { >> >> @span = $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop); >> $c->{"numberSpanned"} = scalar(@span); >> >> ##first dangling == 0 >> if ($c->getDangling($slop) == 0) { ##dangle by at least $slop >> if (scalar(@span) > 0) { >> $c->{"isChimera"} = 0; ##spans and no dangling ends therefore OK >> } else { >> if ($c->getDepth() == 1) { >> $c->{"isChimera"} = 1; >> } else { >> $c->{"isChimera"} = 2; ##will requiere review >> } >> } >> >> ##dangling on only one end... >> } elsif ($c->getDangling($slop) == 1) { >> if (scalar(@span) > 0) { >> $qvssub = 1; ##are very similar....looking for one that isn't... >> foreach my $sp (@span) { >> if ($sp->compareQueryVsSubject() == 0) { >> ##good subject spanning....looks lke should be alternative splicing >> $c->{"isChimera"} = 0; >> $qvssub = 0; >> } >> } >> $c->{"isChimera"} = 1 if $qvssub == 1; >> } else { >> if ($c->getDepth() == 1) { >> $c->{"isChimera"} = 1; >> } else { #will need to review.... >> $c->{"isChimera"} = 3; >> } >> } >> >> ##dangling on two ends... >> } elsif ($c->getDangling($slop) == 2) { >> >> if (scalar(@span) > 0) { >> $qvssub = 1; ##are very similar....looking for one that isn't... >> foreach my $sp (@span) { >> if ($sp->compareQueryVsSubject() == 0) { >> ##unclear here....need to review: >> $c->{"isChimera"} = 4; >> $qvssub = 0; >> ## last; >> } >> } >> $c->{"isChimera"} = 1 if $qvssub == 1; >> } else { >> $c->{"isChimera"} = 1; ##is a chimera >> } >> >> } else { >> print "ERROR: getDangling returned invalid value\n"; >> } >> } >> ##now need to go through each of chimeras and output results... >> my $maxChim = 0; >> my $isChimera = 0; >> my $errString = ""; >> if (exists $self->{"chimeras"}) { >> ## $errString = "Chimeras for $self->{'queryName'}\n"; >> my $cnt = 1; >> foreach my $chim (@{$self->{"chimeras"}}) { >> last if $chim eq ""; ##this is just a holder that indicates that the findChimera >> ##method has been run and there are not chimeras... >> $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > $maxChim; >> $isChimera = 1 if $chim->{'isChimera'} == 1; >> $errString .= " Chimera $cnt: $chim->{'isChimera'}, danglingEnds=".$chim->getDangling($slop).", numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", breakpoint=".$chim->getBreakpoint()."\n"; >> # $errString .= $chim->toString(); >> $cnt++; >> } >> } >> return ($isChimera,$maxChim,$errString); >>} >> >>sub getChimeras { >> my($self,$slop) = @_; >> if(!exists $self->{'chimeras'}){ >> $self->detectChimera($slop); >> } >> return @{$self->{'chimeras'}}; >>} >> >>## returns array of Chimera objs >>sub findChimeraBreakpoints{ >> my($self,$slop) = @_; >> if ($self->{'findChimera'} != 1) { >> my @l = $self->sortSubjectsByMaxQueryEnd(); >> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed >> my $lindex = 0; >> my $rindex = 0; >> for (my $li = $lindex;$li < scalar(@l);$li++) { >> last if $l[$li]->getMaxQueryEnd() - $r[$rindex]->getMinQueryStart() > $slop; >> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >> last if $l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart() > $slop; >> if (abs($l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart()) <= $slop) { >> ##create new chimera object and pass arrays into extendBrkpt() >> ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back the index of last l and r >> last; >> } >> } >> } >> $self->{'findChimera'} = 1; ##indicates that have done the analysis >> } >> if (!exists $self->{'chimeras'}) { >> return; >> } >> return @{$self->{"chimeras"}}; >>} >> >>sub buildChimera{ >> my($self,$lindex,$rindex,$slop) = @_; >> my @l = $self->sortSubjectsByMaxQueryEnd(); >> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed >> my $chim = CBIL::Bio::Blast::Chimera->new(); >> ##now add the current and any others to the chimera >> ## the first time in loop will add the current one to $chim >> ##first do left >> my $ledge = $l[$lindex]->getMaxQueryEnd(); >> for (my $li = $lindex;$li < scalar(@l);$li++) { >> if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) { >> $chim->addLeftSubject($l[$li]); >> } else { >> $lindex = $li; >> last; >> } >> $lindex = $li; >> } >> ##then do right >> my $redge = $r[$rindex]->getMinQueryStart(); >> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >> if ($redge - $r[$ri]->getMinQueryStart() <= $slop) { >> $chim->addRightSubject($r[$ri]); >> } else { >> $rindex = $ri; >> last; >> } >> $rindex = $ri; >> } >> push(@{$self->{"chimeras"}},$chim); >> return($lindex,$rindex); >>} >> >>##returns array of subjects >>##$slop is amount by which it needs to extend beyond the brkpt. >>sub findSubjectsSpanningChimeraBreakpoint{ >> my($self,$loc,$slop) = @_; >> my @span; >> foreach my $s ($self->sortSubjectsByMinQueryStart()) { >> last if $s->getMinQueryStart() > $loc - $slop; >> if ($s->getMaxQueryEnd() > $loc + $slop) { >> ## print "Spans $loc: ",$s->getMinQueryStart(), "-",$s->getMaxQueryEnd(),"\n"; >> push(@span,$s); >> } >> } >> return @span; >>} >> >>######################################## >>## Detecting unblocked repeats... >>## If No stack - OK >>## >>## If Stack >>## >>## If No overhanging ends in query then OK (query is subsumed by subjects) >>## -will block the query sequences and throw out sequences that do not have >>## >50 bp of un-blocked sequence >>## >>## If Overhanging ends in query (may be only one if repeat is at end) >>## >>## If 1 HSP then likely REPEAT => ignore all matches that don't extend >>## outside this region. >>## >>## Overhanging ends in subjects = 1 => definitely REPEAT >>## >>## Overhanging ends in subjects = 0 => ??very unlikely scenario..not clear >>## >>## If >1 HSP (could be multiple copies of same repeat or perhaps alt. splicing >>## OR could it be a valid match that also has a repeat in it...no should in >>## this case match entirely and have just 1 HSP) >>## >>## If subject locations overlap/superimpose then REPEAT >>## =ignore all matches that do not extend outside these locations >>## between them would be "outside" the locations >>## >>## If subject locations distinct then likely alternative splicing >>## OR alternatively the query and subject could contain two distinct >>## different repeats (I will be blocking for repeats so the likelihood >>## that there are two different repeats that are both novel is small). >>## >>## ?do I want to do further analysis to detemine consistency with >>## alternative splicing model before ruling OK...? >>## ######################################## >> >>##this method is the call to find repeats...should return an array of subjects to >>##ignore... >> >>## find 5' edge (minQueryStart) and gather up then check to see if there is a >>## correlation with 3' (maxQueryEnd) >>## edge...if so is potential repeat and need to check the above rules.. >> >>#what to return here....region to ignore when determining good matches... >> >>sub detectRepeats { >> my($self,$slop) = @_; >> >> if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt repeat... >> return; >> } >> >> my @rep = $self->findRepeats($slop); ##array of repeats.... >> >> if ($debug == 1) { >> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >> foreach my $r (@rep) { >> print STDERR $r->toString(1),"\n"; >> } >> } >> >> ##want to merge repeats if they are overlapping... >> my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep; >> undef @rep; ##use for merged ones... >> my $merge = 0; >> my $nr; >> for (my $i = 0;$i<scalar(@sort) - 1;$i++) { >> if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) { >> ##are overlapping...merge >> if ($merge == 0) { >> $nr = CBIL::Bio::Blast::Repeat->new(); >> $merge = 1; >> } >> foreach my $s ($sort[$i]->getSubjects()) { >> $nr->addSubject($s); >> } >> } elsif ($merge == 1) { >> $merge = 0; >> foreach my $s ($sort[$i]->getSubjects()) { >> $nr->addSubject($s); >> } >> push(@rep,$nr); >> } else { >> ##does not overlap so just add to rep >> push(@rep,$sort[$i]); >> } >> } >> ##need to do last if $merge == 1; >> if ($merge == 1) { ##need to add the i+1 (last) case >> foreach my $s ($sort[-1]->getSubjects()) { >> $nr->addSubject($s); >> } >> push(@rep,$nr); >> } >> ##repeats merged and should be non-overlapping >> >> if ($debug == 1) { >> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >> foreach my $r (@rep) { >> print STDERR $r->toString(1),"\n"; >> } >> } >> return @rep; >>} >> >>##finds all repeat stacks.. >># populates array of subjects that are repeats ... @{$self->{'repeats'}} >> >>sub findRepeats{ >> my($self,$slop) = @_; >> my $rep; >> if (!exists $self->{'repeats'}) { >> my $firstElement = 1; >> my $inRepeat = 0; >> my @s = $self->sortSubjectsByMinQueryStart(); >> ###NOTE: want to only take repeat from a single point rather than each minQuery start >> my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge >> for (my $i = 0;$i < scalar(@s) - 1;$i++) { >> if ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) { >> if ($firstElement == 1) { >> $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat object.. >> $firstElement = 0; >> $inRepeat = 1; >> } >> print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" if $debug == 1; >> $rep->addSubject($s[$i]); >> if ($i == scalar(@s) - 2) { ##need to add next one as am in last loop >> $rep->addSubject($s[$i+1]); >> my $verified = $rep->verifyRepeat($slop); ##does all the verification that this is a repeat >> push(@{$self->{'repeats'}},$rep) if $verified == 1; >> } >> >> } elsif ($inRepeat == 1) { ##need this to get the $i+1 as am working one ahead >> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >> $firstElement = 1; ##now looking for the first element >> $inRepeat = 0; >> ##add the last one nd then verify repeats....Repeat method.. >> ##if really a repeat then add to @$self->{'repeats'} >> $rep->addSubject($s[$i]); >> print STDERR "Repeat contains ",$rep->countSubjects(), " subjects\n" if $debug == 1; >> $rep->verifyRepeat($slop); ##does all the verification that this is a repeat >> push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 && $rep->getVerified() == 1); >> } else { >> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >> $firstElement = 1; ##now looking for the first element >> } >> } >> >> if (!exists $self->{'repeats'}) { ##there aren't any....make 0 >> $self->{'repeats'}->[0] = ""; >> } >> } >> if ($self->{'repeats'}->[0] eq "") { >> return; >> } >> return @{$self->{'repeats'}}; >>} >> >>##NOTE: the repeat object is initially populated with things that have a common >>## left edge. It is the job of the Repeat to then verify that this constitutes >>## a repeat and remove any subjects that don't meet ehe rules for a repeat stack above >>#3will need a 'verified' bit to know if the repeat bj is verified...else verify >>## before doing anyhing else. >> >> >>sub getQuery5endMis{ >> my $self = shift; >> if (!exists $self->{"query5pMis"}) { >> $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest >> foreach my $s ($self->getSubjects()) { >> if ($self->{"query5pMis"} > $s->getQuery5endMis()) { >> $self->{"query5pMis"} = $s->getQuery5endMis(); >> } >> } >> } >> return $self->{"query5pMis"}; >>} >> >>sub getQuery3endMis{ >> my $self = shift; >> if (!exists $self->{"query3pMis"}) { >> $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest >> foreach my $s ($self->getSubjects()) { >> if ($self->{"query3pMis"} > $s->getQuery3endMis()) { >> $self->{"query3pMis"} = $s->getQuery3endMis(); >> } >> } >> } >> return $self->{"query3pMis"}; >>} >> >>##returns 0 if leftover at ends and 1 if covered to ends >>sub getCoveredBySubjects{ >> my $self = shift; >> my $left = shift; ##amount to be leftover.... >> if (! $left) { $left = 20; } >> if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() <= $left) { >> return 1; >> } >> return 0; >>} >> >>sub analyzeSubjectsBySubject{ >> my $self = shift; >> return if $self->getSubjectCount() == 0; >> my @singletons; >> my %multiples; >> my $haveImmuno = 0; ##keeps track if is immunoglobulin gene... >> my $order = 0; ##order >> my $orientation = 0; ##orientation >> my $subOver = 0; ##number of subject overlaps >> my $subGaps = 0; ##number of subject Gaps >> my @multiples; >> foreach my $s ($self->getSubjects()) { >> my $id = $s->getID(); >> # if($id <= 100009){ >> # $haveImmuno = 1; >> # next; ##don't want to include these in the analysis >> # } >> if ($s->countHSPs() == 1) { ##don't print out singletons HSPS >> push(@singletons,$id); >> next; >> } >> push(@multiples,$id); >> print "\n\>DT.",$id,", Length=",$s->getLength(),", Num HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1; >> ($subOver,$subGaps) = $s->checkSubjectGaps(); >> print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" if $debug == 1; >> $order = $s->checkOrderConsistency(); >> $orientation = $s->checkOrientationConsistency(); >> print STDERR $s->toString() if $order + $orientation != 2; >> $multiples{$id} = [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples! >> } >> print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1; >> >> # if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";} >> return(\@singletons,\%multiples,$haveImmuno); >>} >> >>sub getStats{ >> my $self = shift; >> my $minNum = shift; ##minimum number of HSPs >> my $ret; >> # $minNum eq "" ? $minNum = 1 : $minNum; >> print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> if ($self->{"countSubjects"} == 0) { >> $ret = "\tNo Hits\n"; >> return $ret; >> } >> foreach my $s ($self->getSubjects()) { >> next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS >> $ret .= $s->toString(); >> } >> return $ret; >>} >> >>sub printStats { >> my($self,$minNum) = @_; >> print $self->getStats($minNum); >>} >> >>sub addSubject{ >> my($self,$sbjct) = @_; >> push(@{$self->{subjects}},$sbjct); >> $self->{"countSubjects"}++; >>} >> >>sub getSubjectCount{ >> my $self = shift; >> # return $self->{"countSubjects"}; >> >> my $queryId = shift; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my $hits = $hits_by_queryId->{$queryId}; >> return $hits->{countSubjects}; >>} >> >>sub getSubjects{ >> my $self = shift; >> # return if $self->{"countSubjects"} == 0; >> # return @{$self->{subjects}}; >> >> my $queryId = shift; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my $hits = $hits_by_queryId->{$queryId}; >> my $count = $hits->{countSubjects}; >> my $subjects = $hits->{subjects}; >> >> return if ($count == 0); >> return @$subjects; >>} >> >>sub getQueryID{ >> my $self = shift; >> # return $self->{"queryName"}; >> >> my $index = $self->{queryIndex}; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my @queryIds = keys (%$hits_by_queryId); >> if ($index < @queryIds) { >> $self->{queryIndex} = $index + 1; >> my $queryId = $queryIds[$index]; >> return $queryId; >> } >> else { >> return undef; >> } >>} >> >>sub printThreshHoldHits { >> my $self = shift; >> return if $self->{"countSubjects"} == 0; >> foreach my $s ($self->getSubjects()) { >> print $self->getQueryID(), "\t",$s->getID(),"\n"; >> } >>} >> >>sub printGeneModels{ >> my $self = shift; >> my @cgm; >> print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> if ($self->{"countSubjects"} == 0) { >> print "\tNo Hits\n"; >> return; >> } >> foreach my $s ($self->getSubjects()) { >> print "\n",$s->toString(); >> @cgm = $s->getConsistentGeneModel(); >> print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar(@cgm), " HSPs\n"; >> foreach my $h (@cgm) { >> print " ",$h->toString(); >> } >> } >>} >> >>sub compareGeneModelWithTotCons{ >> my $self = shift; >> # print "\nQueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> my $cgmgt1 = 0; >> my $corCGM = 0; >> my $totCor = 0; >> my $totSubs = 0; >> my @vals; >> foreach my $s ($self->getSubjects()) { >> next if $s->countHSPs() < 2; ##don't do singletons >> $totSubs++; >> # print "\n",$s->toString(); >> # print $s->getID(),": ",$s->countHSPs()," HSPs\n"; >> # print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n"; >> my($subOver,$subGaps) = $s->checkSubjectGaps(); >> @vals = ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0); >> # print " Old Analysis: Order=$vals[0], Orient=$vals[1], Over=$vals[2], Gaps=$vals[3]\n"; >> $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4; >> $corCGM += $s->checkGoodCGM(); >> $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1; >> } >> return($totSubs,$corCGM,$totCor,$cgmgt1); >>} >> >> >>1; >> >> |
From: Steve F. <sfi...@pc...> - 2004-10-08 12:06:47
|
Arnaud- ok, i've looked at LoadBlastSimFast.pm. I see the addition of the logic to use name to get the query object (and i know that we discussed this in mail back in august). I am having some second thoughts about that change as it stands. The original intent of the plugin was that the sequences submitted to the blast process have been extracted from the database and therefore have the primary key in their definition line. I think I understand that it would be useful to be able to skip that step, ie, blast sequences using their native identifiers, and then have the plugin discover what their internal primary key is. That's what you want to do right? Does anybody know of any reason why that would not be ok? Assuming that nobody has any objections, maybe the best solution would be to improve the plugin to take optional arguments that specify the name of the query and/or subject identifier attributes? For example: -queryIdAttr source_id. This would give us full flexibility (and also avoid the slightly risky assumption that a digits-only identifier must be the primary key) steve Arnaud Kerhornou wrote: > > Steve Fischer wrote: > >> Arnaud- >> >> see below. >> >> steve >> >> Arnaud Kerhornou wrote: >> >>> Hi everyone >>> >>> To be able to reproduce the OrthoMCL method, I would like to raise >>> two issues we've got: >>> >>> * The first issue relates to the view where are stored the protein >>> sequences. I was thinking to use the TranslatedAASequence view as >>> this one contains the translated sequences of our gene models. The >>> problem I have is that it is missing a name attribute so I can not >>> match the blast output query and subject names with the data into >>> GUS (I didn't want to use the TranslatedAASequence primary keys as >>> the identifiers of my proteins of interest). >>> Could we add a name attribute to this view ? >> >> >> >> hmm. not quite following. what would this name be, where would it >> be derived from? > > > By default we assign the systematic id of the corresponding CDS to the > protein name. > >> why not use source_id and/or secondary_identifier? > > > We could do that, but in any case that would involve to modify the > code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get > the sequences entries. At the moment the match is made on the primary > key (which I want to avoid) or the name attribute. The source_id > attribute would do instead of the name attribute. It must work for any > blast (DNA Vs DNA or Protein Vs Protein) with the various potential > GUS sequence objects we want to attach similarity data to. As far as I > can see the source_id attribute is present in all of them > (AASequenceImp and NASequenceImp tables). > >> or, presumably this translated sequence has a relationship back to >> its na sequence (although i don't immediately see that in the schema >> browser), so couldn't you get a name or source_id from there? >> > That would require a more sophisticate query to get the sequence entry. > >>> >>> * The second issue relates to the BLAST output parsing, done by a >>> module called BlastAnal.pm in the CBIL package. This module seems to >>> parse BLAST output file with only one query sequence. I have more >>> than one query sequence reported so I had to change the code of this >>> module to allow more than one query sequence. Can my code be >>> integrated to CBIL package ? Note that I didn't change the >>> interface of this module so it doesn't affect the scripts that are >>> using it, I'm thinking in particular of parseBlastFilesForSimilarity.pl >> >> >> this sounds ok. how about we just take a quick look at this >> together while you are visiting? then we can fold it into the code >> base. do you want to send it by mail? >> > That's fine, the module is attached. > >>> >>> cheers >>> Arnaud >>> >>> >> >------------------------------------------------------------------------ > >#!/usr/bin/perl > >## perl module for dealing with blast results... >## will parse both --noseqs and normal blast. >## does not parse out the alignments, rather meta information for each subject and hsp > >## Brian Brunk 5/17/99 > >package CBIL::Bio::Blast::BlastAnal; > >use CBIL::Bio::Blast::Subject; >use CBIL::Bio::Blast::HSP; >use CBIL::Bio::Blast::Chimera; >use CBIL::Bio::Blast::Repeat; > >use strict; > >my $debug = 0; > >##simple constructor...need to make more robust if inheriting for other purposes >sub new { > my $self = {}; > my $class = shift; > my $d = shift; > bless $self,$class; > $debug = defined $d ? $d : 0; > print STDERR "DEBUGGING ON\n" if $debug == 1; > $self->{"countSubjects"} = 0; > > # Added a hashtable storing the hits for each query sequence > my %tmp; > $self->{hits_by_queryId} = \%tmp; > $self->{queryIndex} = 0; > > return $self; >} > > > > >##DataStructure returned by parseBlast...must have -noseqs on commandline >##new datastructure.... >##{queryLength},{queryName} >##{subjects} = array of Subject objects... > >## takes in a blast result as an array and builds the entire datastructure >sub parseBlast{ > my $self = shift; > my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) = @_; > my %blast; > my ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent); > my $sbjct; > my $queryMatch; > my $haveQStart = 0; > my $haveSStart = 0; > my $parseType; > my $frame; > my $score; > my $identities; > my $positives; > > die 'these are the args to parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' if scalar(@{$blast}) <= 10; > > foreach (@{$blast}) { > # print STDERR $_; > if (/^(\S*BLAST\S+)/){ ##gets the algorighm used for the query... > my $algo = $1; > print STDERR "algorithm = $algo\n" if $debug; > if($algo =~ /BLASTX/i){ > $parseType = 1; > }else{ > $parseType = 0; > } > print STDERR "ParseType='$parseType'\n" if $debug; > } > if (/^Query=\s*(\S+)/) { > > # Storing Hits/HSPs for previous query Id > > if (defined $self->{queryName}) { > my $countSubjects = $self->{countSubjects}; > my $subjects = $self->{subjects}; > my %hits = ( > 'countSubjects' => $countSubjects, > 'subjects' => $subjects > ); > > my $hits_by_queryId = $self->{hits_by_queryId}; > $hits_by_queryId->{$self->{queryName}} = \%hits; > } > > $self->{queryName} = $1; ##query name/id > > # Reinitialisation > $self->{"countSubjects"} = 0 ; > my @tmp = (); > $self->{subjects} = \@tmp; > > } elsif (/^\s+\((\S+)\sletters\)/) { > $self->{queryLength} = $1; ##query length > $self->{queryLength} =~ s/,//g; > } > if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { ##ctxfactor catches the last one...Lambda last of rpsblast > my $sbjctId; > if (/^\>$regex/){ > $sbjctId = $1 ? $1 : $2; > }else{ > print STDERR "$self->{queryName}: Unable to match subject using regex '$regex' on:\n $_" unless (/ctxfactor/ || /^Lambda/); > } > print STDERR "Matching subjectId = $sbjctId\n" if $debug; > ## print STDERR "REGEX matched $_"; > if ($haveQStart) { > print STDERR "Have last QStart:", $sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; > ##have the complete HSP... > ##want to add only if meets minimum reqs....do this on an HSP basis... > # print $sbjct->getID()," $sStart,$sEnd: Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n"; > if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue) { > print STDERR "Match meets reqs\n" if $debug; > if($remMaskedFromLen){ ##want to remove Xs from the match length.... ># print STDERR "removing X from match\n"; ># print STDERR "Before: $queryMatch\n"; > $queryMatch =~ s/X//g; ># print STDERR "After: $queryMatch\n"; > if(length($queryMatch) < 10){ ##don't remove if final length < 10 > print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; > }else{ ># print "match before $matchLength\n"; > $matchLength = length($queryMatch); ># print "match after $matchLength\n"; > } > } > $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct; > } > $haveQStart = 0; ##reset for next one... > $haveSStart = 0; > } > if (defined $sbjct && $sbjct->countHSPs() >= 1 && $sbjct->getID() ne $self->{"queryName"}) { > $self->addSubject($sbjct); > } > ##return heree if end... > return if (/ctxfactor/ || /^Lambda/); > > ## print STDERR "New Subject $sbjctId\n"; > $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId; > ## $sbjct->setID($sbjctId); > > } > if (/^\s+Length\s=\s(\S+)/) { > if($sbjct){ > my $sbjctLength = $1; > $sbjctLength =~ s/,//g; > $sbjct->setLength($sbjctLength); > $sbjct->setQueryLength($self->{"queryLength"}); ##set the query length for use in > } > ##end remaining calculations > } > if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) { > my $tmpScore = $1; > my $tmpValue = $2; > > ##need to add here if $haveQStart != 0; > if ($haveQStart) { > ##have the complete HSP... > ##want to add only if meets minimum reqs....do this on an HSP basis... > print STDERR "Adding Sbjct (score) ",$sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; > if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue && $sbjct) { > print STDERR "Adding Sbjct\n" if $debug; > if($remMaskedFromLen){ ##want to remove Xs from the match length.... ># print STDERR "removing X from match\n"; ># print STDERR "Before: $queryMatch\n"; > $queryMatch =~ s/X//g; ># print STDERR "After: $queryMatch\n"; > if(length($queryMatch) < 10){ ##don't remove if final length < 10 > print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; > }else{ ># print "match before $matchLength\n"; > $matchLength = length($queryMatch); ># print "match after $matchLength\n"; > } > } > $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives); > } > $haveQStart = 0; ##reset for next one... > $haveSStart = 0; > } > $pValue = $tmpValue; > $score = $tmpScore; > } > ##get the strand in the Identities line as the following messes things up!!! > ##get the strand...will work for all blast types... ># if (/^\s*(Minus|Plus)\sStrand/){ ># $strand = $1; ># } > ##following specific to blastx > if ($parseType && /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) { > $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $frame = $5; > $dir = $frame =~ /^\+/ ? 1 : 0; > ##following specific to blastn > }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) { > $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $strand = $5; > $dir = $strand eq 'Minus' ? 0 : 1; > ##following for blastp the default if others not matched... > }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){ > $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; > $dir = 1; ##always 1 for blastp which is what this catches... > }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast > $frame = $1; > $dir = $frame =~ /^\+/ ? 1 : 0; > } > > if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) { > print STDERR "Matching query: $1 $3\n" if $debug; > if ($haveQStart == 0) { > $queryMatch = ""; > $qStart = $1; > $haveQStart = 1; > } > $qEnd = $3; > my $tmpQMatch = $2; > $tmpQMatch =~ s/\s//g; ##gets rid of any spaces > $queryMatch .= $tmpQMatch; > } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) { > if ($haveSStart == 0) { > $sStart = $1; > $haveSStart = 1; > } > $sEnd = $2; > } > } > > # Storing Hits/HSPs for the last query Id > > if (defined $self->{queryName}) { > my $countSubjects = $self->{countSubjects}; > my $subjects = $self->{subjects}; > my %hits = ( > 'countSubjects' => $countSubjects, > 'subjects' => $subjects > ); > my $hits_by_queryId = $self->{hits_by_queryId}; > $hits_by_queryId->{$self->{queryName}} = \%hits; > } > > return; >} > >sub setQueryLength { > my($self,$l) = @_; > $self->{queryLength} = $l; >} > >sub getQueryLength { > my($self) = @_; > return $self->{queryLength}; >} > >sub sortSubjectsByMinQueryStart{ > my $self = shift; > if (!exists $self->{"sortByQStart"}) { > @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> $b->getMinQueryStart()} $self->getSubjects(); > } > return @{$self->{"sortByQStart"}}; >} > >sub sortSubjectsByMaxQueryEnd{ > my $self = shift; > if (!exists $self->{"sortByQEnd"}) { > @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> $b->getMaxQueryEnd()} $self->getSubjects(); > } > return @{$self->{"sortByQEnd"}}; >} > >##detecting chimeras >############################################################ >## Breakpoint = 0 * OK >## Breakpoint = 1 > [Ch, AS, OK, !] >## Dangling Ends = none > [Ch, AS, OK] >## Spanned = 1 * AS, OK >## Spanned = 0 > [Ch, AS, OK] >## Deep = 1 * Ch >## Deep = 0 * Ch, AS, OK (review) >## Dangling Ends = uni > [Ch, As] >## Spanned = 1 > [Ch, AS] >## QvsSub = 1 * Ch >## QvsSub = 0 * AS >## Spanned = 0 > [Ch, AS] >## Deep = 1 * Ch >## Deep = 0 * Ch, AS (review) >## Dangling Ends = bi > [Ch, !] >## Spanned = 1 > [Ch, !] >## QvsSub = 1 * Ch >## QvsSub = 0 * ! (review) ##not sure about this one....should it be OK >## Spanned = 0 * Ch >############################################################ > >##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 dangling, 4=review 2 dangling ends. >##add to ichimera obj... >sub detectChimera{ > my($self,$slop) = @_; ##$slop is the amount of slop in the breakpoint de to blast > ##loop through each possible chimera breakpoint and test if chimera > my $qvssub = 1; > my @span; > foreach my $c ($self->findChimeraBreakpoints($slop)) { > > @span = $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop); > $c->{"numberSpanned"} = scalar(@span); > > ##first dangling == 0 > if ($c->getDangling($slop) == 0) { ##dangle by at least $slop > if (scalar(@span) > 0) { > $c->{"isChimera"} = 0; ##spans and no dangling ends therefore OK > } else { > if ($c->getDepth() == 1) { > $c->{"isChimera"} = 1; > } else { > $c->{"isChimera"} = 2; ##will requiere review > } > } > > ##dangling on only one end... > } elsif ($c->getDangling($slop) == 1) { > if (scalar(@span) > 0) { > $qvssub = 1; ##are very similar....looking for one that isn't... > foreach my $sp (@span) { > if ($sp->compareQueryVsSubject() == 0) { > ##good subject spanning....looks lke should be alternative splicing > $c->{"isChimera"} = 0; > $qvssub = 0; > } > } > $c->{"isChimera"} = 1 if $qvssub == 1; > } else { > if ($c->getDepth() == 1) { > $c->{"isChimera"} = 1; > } else { #will need to review.... > $c->{"isChimera"} = 3; > } > } > > ##dangling on two ends... > } elsif ($c->getDangling($slop) == 2) { > > if (scalar(@span) > 0) { > $qvssub = 1; ##are very similar....looking for one that isn't... > foreach my $sp (@span) { > if ($sp->compareQueryVsSubject() == 0) { > ##unclear here....need to review: > $c->{"isChimera"} = 4; > $qvssub = 0; > ## last; > } > } > $c->{"isChimera"} = 1 if $qvssub == 1; > } else { > $c->{"isChimera"} = 1; ##is a chimera > } > > } else { > print "ERROR: getDangling returned invalid value\n"; > } > } > ##now need to go through each of chimeras and output results... > my $maxChim = 0; > my $isChimera = 0; > my $errString = ""; > if (exists $self->{"chimeras"}) { > ## $errString = "Chimeras for $self->{'queryName'}\n"; > my $cnt = 1; > foreach my $chim (@{$self->{"chimeras"}}) { > last if $chim eq ""; ##this is just a holder that indicates that the findChimera > ##method has been run and there are not chimeras... > $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > $maxChim; > $isChimera = 1 if $chim->{'isChimera'} == 1; > $errString .= " Chimera $cnt: $chim->{'isChimera'}, danglingEnds=".$chim->getDangling($slop).", numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", breakpoint=".$chim->getBreakpoint()."\n"; > # $errString .= $chim->toString(); > $cnt++; > } > } > return ($isChimera,$maxChim,$errString); >} > >sub getChimeras { > my($self,$slop) = @_; > if(!exists $self->{'chimeras'}){ > $self->detectChimera($slop); > } > return @{$self->{'chimeras'}}; >} > >## returns array of Chimera objs >sub findChimeraBreakpoints{ > my($self,$slop) = @_; > if ($self->{'findChimera'} != 1) { > my @l = $self->sortSubjectsByMaxQueryEnd(); > my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed > my $lindex = 0; > my $rindex = 0; > for (my $li = $lindex;$li < scalar(@l);$li++) { > last if $l[$li]->getMaxQueryEnd() - $r[$rindex]->getMinQueryStart() > $slop; > for (my $ri = $rindex;$ri < scalar(@r);$ri++) { > last if $l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart() > $slop; > if (abs($l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart()) <= $slop) { > ##create new chimera object and pass arrays into extendBrkpt() > ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back the index of last l and r > last; > } > } > } > $self->{'findChimera'} = 1; ##indicates that have done the analysis > } > if (!exists $self->{'chimeras'}) { > return; > } > return @{$self->{"chimeras"}}; >} > >sub buildChimera{ > my($self,$lindex,$rindex,$slop) = @_; > my @l = $self->sortSubjectsByMaxQueryEnd(); > my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed > my $chim = CBIL::Bio::Blast::Chimera->new(); > ##now add the current and any others to the chimera > ## the first time in loop will add the current one to $chim > ##first do left > my $ledge = $l[$lindex]->getMaxQueryEnd(); > for (my $li = $lindex;$li < scalar(@l);$li++) { > if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) { > $chim->addLeftSubject($l[$li]); > } else { > $lindex = $li; > last; > } > $lindex = $li; > } > ##then do right > my $redge = $r[$rindex]->getMinQueryStart(); > for (my $ri = $rindex;$ri < scalar(@r);$ri++) { > if ($redge - $r[$ri]->getMinQueryStart() <= $slop) { > $chim->addRightSubject($r[$ri]); > } else { > $rindex = $ri; > last; > } > $rindex = $ri; > } > push(@{$self->{"chimeras"}},$chim); > return($lindex,$rindex); >} > >##returns array of subjects >##$slop is amount by which it needs to extend beyond the brkpt. >sub findSubjectsSpanningChimeraBreakpoint{ > my($self,$loc,$slop) = @_; > my @span; > foreach my $s ($self->sortSubjectsByMinQueryStart()) { > last if $s->getMinQueryStart() > $loc - $slop; > if ($s->getMaxQueryEnd() > $loc + $slop) { > ## print "Spans $loc: ",$s->getMinQueryStart(), "-",$s->getMaxQueryEnd(),"\n"; > push(@span,$s); > } > } > return @span; >} > >######################################## >## Detecting unblocked repeats... >## If No stack - OK >## >## If Stack >## >## If No overhanging ends in query then OK (query is subsumed by subjects) >## -will block the query sequences and throw out sequences that do not have >## >50 bp of un-blocked sequence >## >## If Overhanging ends in query (may be only one if repeat is at end) >## >## If 1 HSP then likely REPEAT => ignore all matches that don't extend >## outside this region. >## >## Overhanging ends in subjects = 1 => definitely REPEAT >## >## Overhanging ends in subjects = 0 => ??very unlikely scenario..not clear >## >## If >1 HSP (could be multiple copies of same repeat or perhaps alt. splicing >## OR could it be a valid match that also has a repeat in it...no should in >## this case match entirely and have just 1 HSP) >## >## If subject locations overlap/superimpose then REPEAT >## =ignore all matches that do not extend outside these locations >## between them would be "outside" the locations >## >## If subject locations distinct then likely alternative splicing >## OR alternatively the query and subject could contain two distinct >## different repeats (I will be blocking for repeats so the likelihood >## that there are two different repeats that are both novel is small). >## >## ?do I want to do further analysis to detemine consistency with >## alternative splicing model before ruling OK...? >## ######################################## > >##this method is the call to find repeats...should return an array of subjects to >##ignore... > >## find 5' edge (minQueryStart) and gather up then check to see if there is a >## correlation with 3' (maxQueryEnd) >## edge...if so is potential repeat and need to check the above rules.. > >#what to return here....region to ignore when determining good matches... > >sub detectRepeats { > my($self,$slop) = @_; > > if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt repeat... > return; > } > > my @rep = $self->findRepeats($slop); ##array of repeats.... > > if ($debug == 1) { > print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; > foreach my $r (@rep) { > print STDERR $r->toString(1),"\n"; > } > } > > ##want to merge repeats if they are overlapping... > my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep; > undef @rep; ##use for merged ones... > my $merge = 0; > my $nr; > for (my $i = 0;$i<scalar(@sort) - 1;$i++) { > if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) { > ##are overlapping...merge > if ($merge == 0) { > $nr = CBIL::Bio::Blast::Repeat->new(); > $merge = 1; > } > foreach my $s ($sort[$i]->getSubjects()) { > $nr->addSubject($s); > } > } elsif ($merge == 1) { > $merge = 0; > foreach my $s ($sort[$i]->getSubjects()) { > $nr->addSubject($s); > } > push(@rep,$nr); > } else { > ##does not overlap so just add to rep > push(@rep,$sort[$i]); > } > } > ##need to do last if $merge == 1; > if ($merge == 1) { ##need to add the i+1 (last) case > foreach my $s ($sort[-1]->getSubjects()) { > $nr->addSubject($s); > } > push(@rep,$nr); > } > ##repeats merged and should be non-overlapping > > if ($debug == 1) { > print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; > foreach my $r (@rep) { > print STDERR $r->toString(1),"\n"; > } > } > return @rep; >} > >##finds all repeat stacks.. ># populates array of subjects that are repeats ... @{$self->{'repeats'}} > >sub findRepeats{ > my($self,$slop) = @_; > my $rep; > if (!exists $self->{'repeats'}) { > my $firstElement = 1; > my $inRepeat = 0; > my @s = $self->sortSubjectsByMinQueryStart(); > ###NOTE: want to only take repeat from a single point rather than each minQuery start > my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge > for (my $i = 0;$i < scalar(@s) - 1;$i++) { > if ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) { > if ($firstElement == 1) { > $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat object.. > $firstElement = 0; > $inRepeat = 1; > } > print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" if $debug == 1; > $rep->addSubject($s[$i]); > if ($i == scalar(@s) - 2) { ##need to add next one as am in last loop > $rep->addSubject($s[$i+1]); > my $verified = $rep->verifyRepeat($slop); ##does all the verification that this is a repeat > push(@{$self->{'repeats'}},$rep) if $verified == 1; > } > > } elsif ($inRepeat == 1) { ##need this to get the $i+1 as am working one ahead > $lEdge = $s[$i]->getMinQueryStart(); ##new left edge > $firstElement = 1; ##now looking for the first element > $inRepeat = 0; > ##add the last one nd then verify repeats....Repeat method.. > ##if really a repeat then add to @$self->{'repeats'} > $rep->addSubject($s[$i]); > print STDERR "Repeat contains ",$rep->countSubjects(), " subjects\n" if $debug == 1; > $rep->verifyRepeat($slop); ##does all the verification that this is a repeat > push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 && $rep->getVerified() == 1); > } else { > $lEdge = $s[$i]->getMinQueryStart(); ##new left edge > $firstElement = 1; ##now looking for the first element > } > } > > if (!exists $self->{'repeats'}) { ##there aren't any....make 0 > $self->{'repeats'}->[0] = ""; > } > } > if ($self->{'repeats'}->[0] eq "") { > return; > } > return @{$self->{'repeats'}}; >} > >##NOTE: the repeat object is initially populated with things that have a common >## left edge. It is the job of the Repeat to then verify that this constitutes >## a repeat and remove any subjects that don't meet ehe rules for a repeat stack above >#3will need a 'verified' bit to know if the repeat bj is verified...else verify >## before doing anyhing else. > > >sub getQuery5endMis{ > my $self = shift; > if (!exists $self->{"query5pMis"}) { > $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest > foreach my $s ($self->getSubjects()) { > if ($self->{"query5pMis"} > $s->getQuery5endMis()) { > $self->{"query5pMis"} = $s->getQuery5endMis(); > } > } > } > return $self->{"query5pMis"}; >} > >sub getQuery3endMis{ > my $self = shift; > if (!exists $self->{"query3pMis"}) { > $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest > foreach my $s ($self->getSubjects()) { > if ($self->{"query3pMis"} > $s->getQuery3endMis()) { > $self->{"query3pMis"} = $s->getQuery3endMis(); > } > } > } > return $self->{"query3pMis"}; >} > >##returns 0 if leftover at ends and 1 if covered to ends >sub getCoveredBySubjects{ > my $self = shift; > my $left = shift; ##amount to be leftover.... > if (! $left) { $left = 20; } > if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() <= $left) { > return 1; > } > return 0; >} > >sub analyzeSubjectsBySubject{ > my $self = shift; > return if $self->getSubjectCount() == 0; > my @singletons; > my %multiples; > my $haveImmuno = 0; ##keeps track if is immunoglobulin gene... > my $order = 0; ##order > my $orientation = 0; ##orientation > my $subOver = 0; ##number of subject overlaps > my $subGaps = 0; ##number of subject Gaps > my @multiples; > foreach my $s ($self->getSubjects()) { > my $id = $s->getID(); > # if($id <= 100009){ > # $haveImmuno = 1; > # next; ##don't want to include these in the analysis > # } > if ($s->countHSPs() == 1) { ##don't print out singletons HSPS > push(@singletons,$id); > next; > } > push(@multiples,$id); > print "\n\>DT.",$id,", Length=",$s->getLength(),", Num HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1; > ($subOver,$subGaps) = $s->checkSubjectGaps(); > print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" if $debug == 1; > $order = $s->checkOrderConsistency(); > $orientation = $s->checkOrientationConsistency(); > print STDERR $s->toString() if $order + $orientation != 2; > $multiples{$id} = [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples! > } > print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1; > > # if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";} > return(\@singletons,\%multiples,$haveImmuno); >} > >sub getStats{ > my $self = shift; > my $minNum = shift; ##minimum number of HSPs > my $ret; > # $minNum eq "" ? $minNum = 1 : $minNum; > print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; > if ($self->{"countSubjects"} == 0) { > $ret = "\tNo Hits\n"; > return $ret; > } > foreach my $s ($self->getSubjects()) { > next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS > $ret .= $s->toString(); > } > return $ret; >} > >sub printStats { > my($self,$minNum) = @_; > print $self->getStats($minNum); >} > >sub addSubject{ > my($self,$sbjct) = @_; > push(@{$self->{subjects}},$sbjct); > $self->{"countSubjects"}++; >} > >sub getSubjectCount{ > my $self = shift; > # return $self->{"countSubjects"}; > > my $queryId = shift; > my $hits_by_queryId = $self->{hits_by_queryId}; > my $hits = $hits_by_queryId->{$queryId}; > return $hits->{countSubjects}; >} > >sub getSubjects{ > my $self = shift; > # return if $self->{"countSubjects"} == 0; > # return @{$self->{subjects}}; > > my $queryId = shift; > my $hits_by_queryId = $self->{hits_by_queryId}; > my $hits = $hits_by_queryId->{$queryId}; > my $count = $hits->{countSubjects}; > my $subjects = $hits->{subjects}; > > return if ($count == 0); > return @$subjects; >} > >sub getQueryID{ > my $self = shift; > # return $self->{"queryName"}; > > my $index = $self->{queryIndex}; > my $hits_by_queryId = $self->{hits_by_queryId}; > my @queryIds = keys (%$hits_by_queryId); > if ($index < @queryIds) { > $self->{queryIndex} = $index + 1; > my $queryId = $queryIds[$index]; > return $queryId; > } > else { > return undef; > } >} > >sub printThreshHoldHits { > my $self = shift; > return if $self->{"countSubjects"} == 0; > foreach my $s ($self->getSubjects()) { > print $self->getQueryID(), "\t",$s->getID(),"\n"; > } >} > >sub printGeneModels{ > my $self = shift; > my @cgm; > print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; > if ($self->{"countSubjects"} == 0) { > print "\tNo Hits\n"; > return; > } > foreach my $s ($self->getSubjects()) { > print "\n",$s->toString(); > @cgm = $s->getConsistentGeneModel(); > print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar(@cgm), " HSPs\n"; > foreach my $h (@cgm) { > print " ",$h->toString(); > } > } >} > >sub compareGeneModelWithTotCons{ > my $self = shift; > # print "\nQueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; > my $cgmgt1 = 0; > my $corCGM = 0; > my $totCor = 0; > my $totSubs = 0; > my @vals; > foreach my $s ($self->getSubjects()) { > next if $s->countHSPs() < 2; ##don't do singletons > $totSubs++; > # print "\n",$s->toString(); > # print $s->getID(),": ",$s->countHSPs()," HSPs\n"; > # print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n"; > my($subOver,$subGaps) = $s->checkSubjectGaps(); > @vals = ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0); > # print " Old Analysis: Order=$vals[0], Orient=$vals[1], Over=$vals[2], Gaps=$vals[3]\n"; > $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4; > $corCGM += $s->checkGoodCGM(); > $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1; > } > return($totSubs,$corCGM,$totCor,$cgmgt1); >} > > >1; > > |