From: Arnaud K. <ax...@sa...> - 2004-10-08 13:41:02
|
Steve Fischer wrote: > Arnaud- > > ok, i've looked at LoadBlastSimFast.pm. I see the addition of the > logic to use name to get the query object (and i know that we > discussed this in mail back in august). > > I am having some second thoughts about that change as it stands. > > The original intent of the plugin was that the sequences submitted to > the blast process have been extracted from the database and therefore > have the primary key in their definition line. > > I think I understand that it would be useful to be able to skip that > step, ie, blast sequences using their native identifiers, and then > have the plugin discover what their internal primary key is. That's > what you want to do right? > that's right. > Does anybody know of any reason why that would not be ok? > > Assuming that nobody has any objections, maybe the best solution would > be to improve the plugin to take optional arguments that specify the > name of the query and/or subject identifier attributes? For > example: -queryIdAttr source_id. This would give us full > flexibility (and also avoid the slightly risky assumption that a > digits-only identifier must be the primary key) > that sounds sensible. > steve > > Arnaud Kerhornou wrote: > >> >> Steve Fischer wrote: >> >>> Arnaud- >>> >>> see below. >>> >>> steve >>> >>> Arnaud Kerhornou wrote: >>> >>>> Hi everyone >>>> >>>> To be able to reproduce the OrthoMCL method, I would like to raise >>>> two issues we've got: >>>> >>>> * The first issue relates to the view where are stored the protein >>>> sequences. I was thinking to use the TranslatedAASequence view as >>>> this one contains the translated sequences of our gene models. The >>>> problem I have is that it is missing a name attribute so I can not >>>> match the blast output query and subject names with the data into >>>> GUS (I didn't want to use the TranslatedAASequence primary keys as >>>> the identifiers of my proteins of interest). >>>> Could we add a name attribute to this view ? >>> >>> >>> >>> hmm. not quite following. what would this name be, where would >>> it be derived from? >> >> >> By default we assign the systematic id of the corresponding CDS to >> the protein name. >> >>> why not use source_id and/or secondary_identifier? >> >> >> We could do that, but in any case that would involve to modify the >> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get >> the sequences entries. At the moment the match is made on the primary >> key (which I want to avoid) or the name attribute. The source_id >> attribute would do instead of the name attribute. It must work for >> any blast (DNA Vs DNA or Protein Vs Protein) with the various >> potential GUS sequence objects we want to attach similarity data to. >> As far as I can see the source_id attribute is present in all of them >> (AASequenceImp and NASequenceImp tables). >> >>> or, presumably this translated sequence has a relationship back to >>> its na sequence (although i don't immediately see that in the schema >>> browser), so couldn't you get a name or source_id from there? >>> >> That would require a more sophisticate query to get the sequence entry. >> >>>> >>>> * The second issue relates to the BLAST output parsing, done by a >>>> module called BlastAnal.pm in the CBIL package. This module seems >>>> to parse BLAST output file with only one query sequence. I have >>>> more than one query sequence reported so I had to change the code >>>> of this module to allow more than one query sequence. Can my code >>>> be integrated to CBIL package ? Note that I didn't change the >>>> interface of this module so it doesn't affect the scripts that are >>>> using it, I'm thinking in particular of >>>> parseBlastFilesForSimilarity.pl >>> >>> >>> this sounds ok. how about we just take a quick look at this >>> together while you are visiting? then we can fold it into the code >>> base. do you want to send it by mail? >>> >> That's fine, the module is attached. >> >>>> >>>> cheers >>>> Arnaud >>>> >>>> >>> >>------------------------------------------------------------------------ >> >>#!/usr/bin/perl >> >>## perl module for dealing with blast results... >>## will parse both --noseqs and normal blast. >>## does not parse out the alignments, rather meta information for each subject and hsp >> >>## Brian Brunk 5/17/99 >> >>package CBIL::Bio::Blast::BlastAnal; >> >>use CBIL::Bio::Blast::Subject; >>use CBIL::Bio::Blast::HSP; >>use CBIL::Bio::Blast::Chimera; >>use CBIL::Bio::Blast::Repeat; >> >>use strict; >> >>my $debug = 0; >> >>##simple constructor...need to make more robust if inheriting for other purposes >>sub new { >> my $self = {}; >> my $class = shift; >> my $d = shift; >> bless $self,$class; >> $debug = defined $d ? $d : 0; >> print STDERR "DEBUGGING ON\n" if $debug == 1; >> $self->{"countSubjects"} = 0; >> >> # Added a hashtable storing the hits for each query sequence >> my %tmp; >> $self->{hits_by_queryId} = \%tmp; >> $self->{queryIndex} = 0; >> >> return $self; >>} >> >> >> >> >>##DataStructure returned by parseBlast...must have -noseqs on commandline >>##new datastructure.... >>##{queryLength},{queryName} >>##{subjects} = array of Subject objects... >> >>## takes in a blast result as an array and builds the entire datastructure >>sub parseBlast{ >> my $self = shift; >> my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) = @_; >> my %blast; >> my ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent); >> my $sbjct; >> my $queryMatch; >> my $haveQStart = 0; >> my $haveSStart = 0; >> my $parseType; >> my $frame; >> my $score; >> my $identities; >> my $positives; >> >> die 'these are the args to parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' if scalar(@{$blast}) <= 10; >> >> foreach (@{$blast}) { >> # print STDERR $_; >> if (/^(\S*BLAST\S+)/){ ##gets the algorighm used for the query... >> my $algo = $1; >> print STDERR "algorithm = $algo\n" if $debug; >> if($algo =~ /BLASTX/i){ >> $parseType = 1; >> }else{ >> $parseType = 0; >> } >> print STDERR "ParseType='$parseType'\n" if $debug; >> } >> if (/^Query=\s*(\S+)/) { >> >> # Storing Hits/HSPs for previous query Id >> >> if (defined $self->{queryName}) { >> my $countSubjects = $self->{countSubjects}; >> my $subjects = $self->{subjects}; >> my %hits = ( >> 'countSubjects' => $countSubjects, >> 'subjects' => $subjects >> ); >> >> my $hits_by_queryId = $self->{hits_by_queryId}; >> $hits_by_queryId->{$self->{queryName}} = \%hits; >> } >> >> $self->{queryName} = $1; ##query name/id >> >> # Reinitialisation >> $self->{"countSubjects"} = 0 ; >> my @tmp = (); >> $self->{subjects} = \@tmp; >> >> } elsif (/^\s+\((\S+)\sletters\)/) { >> $self->{queryLength} = $1; ##query length >> $self->{queryLength} =~ s/,//g; >> } >> if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { ##ctxfactor catches the last one...Lambda last of rpsblast >> my $sbjctId; >> if (/^\>$regex/){ >> $sbjctId = $1 ? $1 : $2; >> }else{ >> print STDERR "$self->{queryName}: Unable to match subject using regex '$regex' on:\n $_" unless (/ctxfactor/ || /^Lambda/); >> } >> print STDERR "Matching subjectId = $sbjctId\n" if $debug; >> ## print STDERR "REGEX matched $_"; >> if ($haveQStart) { >> print STDERR "Have last QStart:", $sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >> ##have the complete HSP... >> ##want to add only if meets minimum reqs....do this on an HSP basis... >> # print $sbjct->getID()," $sStart,$sEnd: Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n"; >> if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue) { >> print STDERR "Match meets reqs\n" if $debug; >> if($remMaskedFromLen){ ##want to remove Xs from the match length.... >># print STDERR "removing X from match\n"; >># print STDERR "Before: $queryMatch\n"; >> $queryMatch =~ s/X//g; >># print STDERR "After: $queryMatch\n"; >> if(length($queryMatch) < 10){ ##don't remove if final length < 10 >> print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; >> }else{ >># print "match before $matchLength\n"; >> $matchLength = length($queryMatch); >># print "match after $matchLength\n"; >> } >> } >> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct; >> } >> $haveQStart = 0; ##reset for next one... >> $haveSStart = 0; >> } >> if (defined $sbjct && $sbjct->countHSPs() >= 1 && $sbjct->getID() ne $self->{"queryName"}) { >> $self->addSubject($sbjct); >> } >> ##return heree if end... >> return if (/ctxfactor/ || /^Lambda/); >> >> ## print STDERR "New Subject $sbjctId\n"; >> $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId; >> ## $sbjct->setID($sbjctId); >> >> } >> if (/^\s+Length\s=\s(\S+)/) { >> if($sbjct){ >> my $sbjctLength = $1; >> $sbjctLength =~ s/,//g; >> $sbjct->setLength($sbjctLength); >> $sbjct->setQueryLength($self->{"queryLength"}); ##set the query length for use in >> } >> ##end remaining calculations >> } >> if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) { >> my $tmpScore = $1; >> my $tmpValue = $2; >> >> ##need to add here if $haveQStart != 0; >> if ($haveQStart) { >> ##have the complete HSP... >> ##want to add only if meets minimum reqs....do this on an HSP basis... >> print STDERR "Adding Sbjct (score) ",$sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1; >> if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue && $sbjct) { >> print STDERR "Adding Sbjct\n" if $debug; >> if($remMaskedFromLen){ ##want to remove Xs from the match length.... >># print STDERR "removing X from match\n"; >># print STDERR "Before: $queryMatch\n"; >> $queryMatch =~ s/X//g; >># print STDERR "After: $queryMatch\n"; >> if(length($queryMatch) < 10){ ##don't remove if final length < 10 >> print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n"; >> }else{ >># print "match before $matchLength\n"; >> $matchLength = length($queryMatch); >># print "match after $matchLength\n"; >> } >> } >> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives); >> } >> $haveQStart = 0; ##reset for next one... >> $haveSStart = 0; >> } >> $pValue = $tmpValue; >> $score = $tmpScore; >> } >> ##get the strand in the Identities line as the following messes things up!!! >> ##get the strand...will work for all blast types... >># if (/^\s*(Minus|Plus)\sStrand/){ >># $strand = $1; >># } >> ##following specific to blastx >> if ($parseType && /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) { >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $frame = $5; >> $dir = $frame =~ /^\+/ ? 1 : 0; >> ##following specific to blastn >> }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) { >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $strand = $5; >> $dir = $strand eq 'Minus' ? 0 : 1; >> ##following for blastp the default if others not matched... >> }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){ >> $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; >> $dir = 1; ##always 1 for blastp which is what this catches... >> }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast >> $frame = $1; >> $dir = $frame =~ /^\+/ ? 1 : 0; >> } >> >> if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) { >> print STDERR "Matching query: $1 $3\n" if $debug; >> if ($haveQStart == 0) { >> $queryMatch = ""; >> $qStart = $1; >> $haveQStart = 1; >> } >> $qEnd = $3; >> my $tmpQMatch = $2; >> $tmpQMatch =~ s/\s//g; ##gets rid of any spaces >> $queryMatch .= $tmpQMatch; >> } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) { >> if ($haveSStart == 0) { >> $sStart = $1; >> $haveSStart = 1; >> } >> $sEnd = $2; >> } >> } >> >> # Storing Hits/HSPs for the last query Id >> >> if (defined $self->{queryName}) { >> my $countSubjects = $self->{countSubjects}; >> my $subjects = $self->{subjects}; >> my %hits = ( >> 'countSubjects' => $countSubjects, >> 'subjects' => $subjects >> ); >> my $hits_by_queryId = $self->{hits_by_queryId}; >> $hits_by_queryId->{$self->{queryName}} = \%hits; >> } >> >> return; >>} >> >>sub setQueryLength { >> my($self,$l) = @_; >> $self->{queryLength} = $l; >>} >> >>sub getQueryLength { >> my($self) = @_; >> return $self->{queryLength}; >>} >> >>sub sortSubjectsByMinQueryStart{ >> my $self = shift; >> if (!exists $self->{"sortByQStart"}) { >> @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> $b->getMinQueryStart()} $self->getSubjects(); >> } >> return @{$self->{"sortByQStart"}}; >>} >> >>sub sortSubjectsByMaxQueryEnd{ >> my $self = shift; >> if (!exists $self->{"sortByQEnd"}) { >> @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> $b->getMaxQueryEnd()} $self->getSubjects(); >> } >> return @{$self->{"sortByQEnd"}}; >>} >> >>##detecting chimeras >>############################################################ >>## Breakpoint = 0 * OK >>## Breakpoint = 1 > [Ch, AS, OK, !] >>## Dangling Ends = none > [Ch, AS, OK] >>## Spanned = 1 * AS, OK >>## Spanned = 0 > [Ch, AS, OK] >>## Deep = 1 * Ch >>## Deep = 0 * Ch, AS, OK (review) >>## Dangling Ends = uni > [Ch, As] >>## Spanned = 1 > [Ch, AS] >>## QvsSub = 1 * Ch >>## QvsSub = 0 * AS >>## Spanned = 0 > [Ch, AS] >>## Deep = 1 * Ch >>## Deep = 0 * Ch, AS (review) >>## Dangling Ends = bi > [Ch, !] >>## Spanned = 1 > [Ch, !] >>## QvsSub = 1 * Ch >>## QvsSub = 0 * ! (review) ##not sure about this one....should it be OK >>## Spanned = 0 * Ch >>############################################################ >> >>##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 dangling, 4=review 2 dangling ends. >>##add to ichimera obj... >>sub detectChimera{ >> my($self,$slop) = @_; ##$slop is the amount of slop in the breakpoint de to blast >> ##loop through each possible chimera breakpoint and test if chimera >> my $qvssub = 1; >> my @span; >> foreach my $c ($self->findChimeraBreakpoints($slop)) { >> >> @span = $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop); >> $c->{"numberSpanned"} = scalar(@span); >> >> ##first dangling == 0 >> if ($c->getDangling($slop) == 0) { ##dangle by at least $slop >> if (scalar(@span) > 0) { >> $c->{"isChimera"} = 0; ##spans and no dangling ends therefore OK >> } else { >> if ($c->getDepth() == 1) { >> $c->{"isChimera"} = 1; >> } else { >> $c->{"isChimera"} = 2; ##will requiere review >> } >> } >> >> ##dangling on only one end... >> } elsif ($c->getDangling($slop) == 1) { >> if (scalar(@span) > 0) { >> $qvssub = 1; ##are very similar....looking for one that isn't... >> foreach my $sp (@span) { >> if ($sp->compareQueryVsSubject() == 0) { >> ##good subject spanning....looks lke should be alternative splicing >> $c->{"isChimera"} = 0; >> $qvssub = 0; >> } >> } >> $c->{"isChimera"} = 1 if $qvssub == 1; >> } else { >> if ($c->getDepth() == 1) { >> $c->{"isChimera"} = 1; >> } else { #will need to review.... >> $c->{"isChimera"} = 3; >> } >> } >> >> ##dangling on two ends... >> } elsif ($c->getDangling($slop) == 2) { >> >> if (scalar(@span) > 0) { >> $qvssub = 1; ##are very similar....looking for one that isn't... >> foreach my $sp (@span) { >> if ($sp->compareQueryVsSubject() == 0) { >> ##unclear here....need to review: >> $c->{"isChimera"} = 4; >> $qvssub = 0; >> ## last; >> } >> } >> $c->{"isChimera"} = 1 if $qvssub == 1; >> } else { >> $c->{"isChimera"} = 1; ##is a chimera >> } >> >> } else { >> print "ERROR: getDangling returned invalid value\n"; >> } >> } >> ##now need to go through each of chimeras and output results... >> my $maxChim = 0; >> my $isChimera = 0; >> my $errString = ""; >> if (exists $self->{"chimeras"}) { >> ## $errString = "Chimeras for $self->{'queryName'}\n"; >> my $cnt = 1; >> foreach my $chim (@{$self->{"chimeras"}}) { >> last if $chim eq ""; ##this is just a holder that indicates that the findChimera >> ##method has been run and there are not chimeras... >> $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > $maxChim; >> $isChimera = 1 if $chim->{'isChimera'} == 1; >> $errString .= " Chimera $cnt: $chim->{'isChimera'}, danglingEnds=".$chim->getDangling($slop).", numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", breakpoint=".$chim->getBreakpoint()."\n"; >> # $errString .= $chim->toString(); >> $cnt++; >> } >> } >> return ($isChimera,$maxChim,$errString); >>} >> >>sub getChimeras { >> my($self,$slop) = @_; >> if(!exists $self->{'chimeras'}){ >> $self->detectChimera($slop); >> } >> return @{$self->{'chimeras'}}; >>} >> >>## returns array of Chimera objs >>sub findChimeraBreakpoints{ >> my($self,$slop) = @_; >> if ($self->{'findChimera'} != 1) { >> my @l = $self->sortSubjectsByMaxQueryEnd(); >> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed >> my $lindex = 0; >> my $rindex = 0; >> for (my $li = $lindex;$li < scalar(@l);$li++) { >> last if $l[$li]->getMaxQueryEnd() - $r[$rindex]->getMinQueryStart() > $slop; >> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >> last if $l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart() > $slop; >> if (abs($l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart()) <= $slop) { >> ##create new chimera object and pass arrays into extendBrkpt() >> ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back the index of last l and r >> last; >> } >> } >> } >> $self->{'findChimera'} = 1; ##indicates that have done the analysis >> } >> if (!exists $self->{'chimeras'}) { >> return; >> } >> return @{$self->{"chimeras"}}; >>} >> >>sub buildChimera{ >> my($self,$lindex,$rindex,$slop) = @_; >> my @l = $self->sortSubjectsByMaxQueryEnd(); >> my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed >> my $chim = CBIL::Bio::Blast::Chimera->new(); >> ##now add the current and any others to the chimera >> ## the first time in loop will add the current one to $chim >> ##first do left >> my $ledge = $l[$lindex]->getMaxQueryEnd(); >> for (my $li = $lindex;$li < scalar(@l);$li++) { >> if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) { >> $chim->addLeftSubject($l[$li]); >> } else { >> $lindex = $li; >> last; >> } >> $lindex = $li; >> } >> ##then do right >> my $redge = $r[$rindex]->getMinQueryStart(); >> for (my $ri = $rindex;$ri < scalar(@r);$ri++) { >> if ($redge - $r[$ri]->getMinQueryStart() <= $slop) { >> $chim->addRightSubject($r[$ri]); >> } else { >> $rindex = $ri; >> last; >> } >> $rindex = $ri; >> } >> push(@{$self->{"chimeras"}},$chim); >> return($lindex,$rindex); >>} >> >>##returns array of subjects >>##$slop is amount by which it needs to extend beyond the brkpt. >>sub findSubjectsSpanningChimeraBreakpoint{ >> my($self,$loc,$slop) = @_; >> my @span; >> foreach my $s ($self->sortSubjectsByMinQueryStart()) { >> last if $s->getMinQueryStart() > $loc - $slop; >> if ($s->getMaxQueryEnd() > $loc + $slop) { >> ## print "Spans $loc: ",$s->getMinQueryStart(), "-",$s->getMaxQueryEnd(),"\n"; >> push(@span,$s); >> } >> } >> return @span; >>} >> >>######################################## >>## Detecting unblocked repeats... >>## If No stack - OK >>## >>## If Stack >>## >>## If No overhanging ends in query then OK (query is subsumed by subjects) >>## -will block the query sequences and throw out sequences that do not have >>## >50 bp of un-blocked sequence >>## >>## If Overhanging ends in query (may be only one if repeat is at end) >>## >>## If 1 HSP then likely REPEAT => ignore all matches that don't extend >>## outside this region. >>## >>## Overhanging ends in subjects = 1 => definitely REPEAT >>## >>## Overhanging ends in subjects = 0 => ??very unlikely scenario..not clear >>## >>## If >1 HSP (could be multiple copies of same repeat or perhaps alt. splicing >>## OR could it be a valid match that also has a repeat in it...no should in >>## this case match entirely and have just 1 HSP) >>## >>## If subject locations overlap/superimpose then REPEAT >>## =ignore all matches that do not extend outside these locations >>## between them would be "outside" the locations >>## >>## If subject locations distinct then likely alternative splicing >>## OR alternatively the query and subject could contain two distinct >>## different repeats (I will be blocking for repeats so the likelihood >>## that there are two different repeats that are both novel is small). >>## >>## ?do I want to do further analysis to detemine consistency with >>## alternative splicing model before ruling OK...? >>## ######################################## >> >>##this method is the call to find repeats...should return an array of subjects to >>##ignore... >> >>## find 5' edge (minQueryStart) and gather up then check to see if there is a >>## correlation with 3' (maxQueryEnd) >>## edge...if so is potential repeat and need to check the above rules.. >> >>#what to return here....region to ignore when determining good matches... >> >>sub detectRepeats { >> my($self,$slop) = @_; >> >> if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt repeat... >> return; >> } >> >> my @rep = $self->findRepeats($slop); ##array of repeats.... >> >> if ($debug == 1) { >> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >> foreach my $r (@rep) { >> print STDERR $r->toString(1),"\n"; >> } >> } >> >> ##want to merge repeats if they are overlapping... >> my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep; >> undef @rep; ##use for merged ones... >> my $merge = 0; >> my $nr; >> for (my $i = 0;$i<scalar(@sort) - 1;$i++) { >> if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) { >> ##are overlapping...merge >> if ($merge == 0) { >> $nr = CBIL::Bio::Blast::Repeat->new(); >> $merge = 1; >> } >> foreach my $s ($sort[$i]->getSubjects()) { >> $nr->addSubject($s); >> } >> } elsif ($merge == 1) { >> $merge = 0; >> foreach my $s ($sort[$i]->getSubjects()) { >> $nr->addSubject($s); >> } >> push(@rep,$nr); >> } else { >> ##does not overlap so just add to rep >> push(@rep,$sort[$i]); >> } >> } >> ##need to do last if $merge == 1; >> if ($merge == 1) { ##need to add the i+1 (last) case >> foreach my $s ($sort[-1]->getSubjects()) { >> $nr->addSubject($s); >> } >> push(@rep,$nr); >> } >> ##repeats merged and should be non-overlapping >> >> if ($debug == 1) { >> print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n"; >> foreach my $r (@rep) { >> print STDERR $r->toString(1),"\n"; >> } >> } >> return @rep; >>} >> >>##finds all repeat stacks.. >># populates array of subjects that are repeats ... @{$self->{'repeats'}} >> >>sub findRepeats{ >> my($self,$slop) = @_; >> my $rep; >> if (!exists $self->{'repeats'}) { >> my $firstElement = 1; >> my $inRepeat = 0; >> my @s = $self->sortSubjectsByMinQueryStart(); >> ###NOTE: want to only take repeat from a single point rather than each minQuery start >> my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge >> for (my $i = 0;$i < scalar(@s) - 1;$i++) { >> if ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) { >> if ($firstElement == 1) { >> $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat object.. >> $firstElement = 0; >> $inRepeat = 1; >> } >> print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" if $debug == 1; >> $rep->addSubject($s[$i]); >> if ($i == scalar(@s) - 2) { ##need to add next one as am in last loop >> $rep->addSubject($s[$i+1]); >> my $verified = $rep->verifyRepeat($slop); ##does all the verification that this is a repeat >> push(@{$self->{'repeats'}},$rep) if $verified == 1; >> } >> >> } elsif ($inRepeat == 1) { ##need this to get the $i+1 as am working one ahead >> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >> $firstElement = 1; ##now looking for the first element >> $inRepeat = 0; >> ##add the last one nd then verify repeats....Repeat method.. >> ##if really a repeat then add to @$self->{'repeats'} >> $rep->addSubject($s[$i]); >> print STDERR "Repeat contains ",$rep->countSubjects(), " subjects\n" if $debug == 1; >> $rep->verifyRepeat($slop); ##does all the verification that this is a repeat >> push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 && $rep->getVerified() == 1); >> } else { >> $lEdge = $s[$i]->getMinQueryStart(); ##new left edge >> $firstElement = 1; ##now looking for the first element >> } >> } >> >> if (!exists $self->{'repeats'}) { ##there aren't any....make 0 >> $self->{'repeats'}->[0] = ""; >> } >> } >> if ($self->{'repeats'}->[0] eq "") { >> return; >> } >> return @{$self->{'repeats'}}; >>} >> >>##NOTE: the repeat object is initially populated with things that have a common >>## left edge. It is the job of the Repeat to then verify that this constitutes >>## a repeat and remove any subjects that don't meet ehe rules for a repeat stack above >>#3will need a 'verified' bit to know if the repeat bj is verified...else verify >>## before doing anyhing else. >> >> >>sub getQuery5endMis{ >> my $self = shift; >> if (!exists $self->{"query5pMis"}) { >> $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest >> foreach my $s ($self->getSubjects()) { >> if ($self->{"query5pMis"} > $s->getQuery5endMis()) { >> $self->{"query5pMis"} = $s->getQuery5endMis(); >> } >> } >> } >> return $self->{"query5pMis"}; >>} >> >>sub getQuery3endMis{ >> my $self = shift; >> if (!exists $self->{"query3pMis"}) { >> $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest >> foreach my $s ($self->getSubjects()) { >> if ($self->{"query3pMis"} > $s->getQuery3endMis()) { >> $self->{"query3pMis"} = $s->getQuery3endMis(); >> } >> } >> } >> return $self->{"query3pMis"}; >>} >> >>##returns 0 if leftover at ends and 1 if covered to ends >>sub getCoveredBySubjects{ >> my $self = shift; >> my $left = shift; ##amount to be leftover.... >> if (! $left) { $left = 20; } >> if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() <= $left) { >> return 1; >> } >> return 0; >>} >> >>sub analyzeSubjectsBySubject{ >> my $self = shift; >> return if $self->getSubjectCount() == 0; >> my @singletons; >> my %multiples; >> my $haveImmuno = 0; ##keeps track if is immunoglobulin gene... >> my $order = 0; ##order >> my $orientation = 0; ##orientation >> my $subOver = 0; ##number of subject overlaps >> my $subGaps = 0; ##number of subject Gaps >> my @multiples; >> foreach my $s ($self->getSubjects()) { >> my $id = $s->getID(); >> # if($id <= 100009){ >> # $haveImmuno = 1; >> # next; ##don't want to include these in the analysis >> # } >> if ($s->countHSPs() == 1) { ##don't print out singletons HSPS >> push(@singletons,$id); >> next; >> } >> push(@multiples,$id); >> print "\n\>DT.",$id,", Length=",$s->getLength(),", Num HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1; >> ($subOver,$subGaps) = $s->checkSubjectGaps(); >> print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" if $debug == 1; >> $order = $s->checkOrderConsistency(); >> $orientation = $s->checkOrientationConsistency(); >> print STDERR $s->toString() if $order + $orientation != 2; >> $multiples{$id} = [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples! >> } >> print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1; >> >> # if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";} >> return(\@singletons,\%multiples,$haveImmuno); >>} >> >>sub getStats{ >> my $self = shift; >> my $minNum = shift; ##minimum number of HSPs >> my $ret; >> # $minNum eq "" ? $minNum = 1 : $minNum; >> print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> if ($self->{"countSubjects"} == 0) { >> $ret = "\tNo Hits\n"; >> return $ret; >> } >> foreach my $s ($self->getSubjects()) { >> next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS >> $ret .= $s->toString(); >> } >> return $ret; >>} >> >>sub printStats { >> my($self,$minNum) = @_; >> print $self->getStats($minNum); >>} >> >>sub addSubject{ >> my($self,$sbjct) = @_; >> push(@{$self->{subjects}},$sbjct); >> $self->{"countSubjects"}++; >>} >> >>sub getSubjectCount{ >> my $self = shift; >> # return $self->{"countSubjects"}; >> >> my $queryId = shift; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my $hits = $hits_by_queryId->{$queryId}; >> return $hits->{countSubjects}; >>} >> >>sub getSubjects{ >> my $self = shift; >> # return if $self->{"countSubjects"} == 0; >> # return @{$self->{subjects}}; >> >> my $queryId = shift; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my $hits = $hits_by_queryId->{$queryId}; >> my $count = $hits->{countSubjects}; >> my $subjects = $hits->{subjects}; >> >> return if ($count == 0); >> return @$subjects; >>} >> >>sub getQueryID{ >> my $self = shift; >> # return $self->{"queryName"}; >> >> my $index = $self->{queryIndex}; >> my $hits_by_queryId = $self->{hits_by_queryId}; >> my @queryIds = keys (%$hits_by_queryId); >> if ($index < @queryIds) { >> $self->{queryIndex} = $index + 1; >> my $queryId = $queryIds[$index]; >> return $queryId; >> } >> else { >> return undef; >> } >>} >> >>sub printThreshHoldHits { >> my $self = shift; >> return if $self->{"countSubjects"} == 0; >> foreach my $s ($self->getSubjects()) { >> print $self->getQueryID(), "\t",$s->getID(),"\n"; >> } >>} >> >>sub printGeneModels{ >> my $self = shift; >> my @cgm; >> print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> if ($self->{"countSubjects"} == 0) { >> print "\tNo Hits\n"; >> return; >> } >> foreach my $s ($self->getSubjects()) { >> print "\n",$s->toString(); >> @cgm = $s->getConsistentGeneModel(); >> print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar(@cgm), " HSPs\n"; >> foreach my $h (@cgm) { >> print " ",$h->toString(); >> } >> } >>} >> >>sub compareGeneModelWithTotCons{ >> my $self = shift; >> # print "\nQueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n"; >> my $cgmgt1 = 0; >> my $corCGM = 0; >> my $totCor = 0; >> my $totSubs = 0; >> my @vals; >> foreach my $s ($self->getSubjects()) { >> next if $s->countHSPs() < 2; ##don't do singletons >> $totSubs++; >> # print "\n",$s->toString(); >> # print $s->getID(),": ",$s->countHSPs()," HSPs\n"; >> # print " CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n"; >> my($subOver,$subGaps) = $s->checkSubjectGaps(); >> @vals = ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0); >> # print " Old Analysis: Order=$vals[0], Orient=$vals[1], Over=$vals[2], Gaps=$vals[3]\n"; >> $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4; >> $corCGM += $s->checkGoodCGM(); >> $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1; >> } >> return($totSubs,$corCGM,$totCor,$cgmgt1); >>} >> >> >>1; >> >> |