[Gusdev-gusdev] LoadBlastSimFast issue

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Arnaud-

ok, i've looked at LoadBlastSimFast.pm.     I see the addition of the 
logic to use name to get the query object (and i know that we discussed 
this in mail back in august).

I am having some second thoughts about that change as it stands.  

The original intent of the plugin was that the sequences submitted to 
the blast process have been extracted from the database and therefore 
have the primary key in their definition line.

I think I understand that it would be useful to be able to skip that 
step, ie, blast sequences using their native identifiers, and then have 
the plugin discover what their internal primary key is.    That's what 
you want to do right?

Does anybody know of any reason why that would not be ok?

Assuming that nobody has any objections, maybe the best solution would 
be to improve the plugin to take optional arguments that specify the 
name of the query and/or subject identifier attributes?    For example:  
-queryIdAttr source_id.   This would give us full flexibility (and also 
avoid the slightly risky assumption that a digits-only identifier must 
be the primary key)

steve

Arnaud Kerhornou wrote:

>
> Steve Fischer wrote:
>
>> Arnaud-
>>
>> see below.
>>
>> steve
>>
>> Arnaud Kerhornou wrote:
>>
>>> Hi everyone
>>>
>>> To be able to reproduce the OrthoMCL method, I would like to raise 
>>> two issues we've got:
>>>
>>> * The first issue relates to the view where are stored the protein 
>>> sequences. I was thinking to use the TranslatedAASequence view as 
>>> this one contains the translated sequences of our gene models. The 
>>> problem I have is that it is missing a name attribute so I can not 
>>> match the blast output query and subject names with the data into 
>>> GUS (I didn't want to use the TranslatedAASequence primary keys as 
>>> the identifiers of my proteins of interest).
>>> Could we add a name attribute to this view ?
>>
>>
>>
>> hmm.   not quite following.   what would this name be, where would it 
>> be derived from?
>
>
> By default we assign the systematic id of the corresponding CDS to the 
> protein name.
>
>>   why not use source_id and/or secondary_identifier? 
>
>
> We could do that, but in any case that would involve to modify the 
> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get 
> the sequences entries. At the moment the match is made on the primary 
> key (which I want to avoid) or the name attribute. The source_id 
> attribute would do instead of the name attribute. It must work for any 
> blast (DNA Vs DNA or Protein Vs Protein) with the various potential 
> GUS sequence objects we want to attach similarity data to. As far as I 
> can see the source_id attribute is present in all of them 
> (AASequenceImp and NASequenceImp tables).
>
>>   or, presumably this translated sequence has a relationship back to 
>> its na sequence (although i don't immediately see that in the schema 
>> browser), so couldn't you get a name or source_id from there?
>>
> That would require a more sophisticate query to get the sequence entry.
>
>>>
>>> * The second issue relates to the BLAST output parsing, done by a 
>>> module called BlastAnal.pm in the CBIL package. This module seems to 
>>> parse BLAST output file with only one query sequence. I have more 
>>> than one query sequence reported so I had to change the code of this 
>>> module to allow more than one query sequence. Can my code be 
>>> integrated  to CBIL package ? Note that I didn't change the 
>>> interface of this module so it doesn't affect the scripts that are 
>>> using it, I'm thinking in particular of parseBlastFilesForSimilarity.pl
>>
>>
>> this sounds ok.   how about we just take a quick look at this 
>> together while you are visiting?   then we can fold it into the code 
>> base.   do you want to send it by mail?
>>
> That's fine, the module is attached.
>
>>>
>>> cheers
>>> Arnaud
>>>
>>>
>>
>------------------------------------------------------------------------
>
>#!/usr/bin/perl
>
>## perl module for dealing with blast results...
>## will parse both  --noseqs and normal blast.
>## does not parse out the alignments, rather meta information for each subject and hsp
>
>## Brian Brunk 5/17/99
>
>package CBIL::Bio::Blast::BlastAnal;
>
>use CBIL::Bio::Blast::Subject;
>use CBIL::Bio::Blast::HSP;
>use CBIL::Bio::Blast::Chimera;
>use CBIL::Bio::Blast::Repeat;
>
>use strict;
>
>my $debug = 0;
>
>##simple constructor...need to make more robust if inheriting for other purposes
>sub new {
>  my $self = {};
>  my $class = shift;
>  my $d = shift;
>  bless $self,$class;
>  $debug = defined $d ? $d : 0;
>  print STDERR "DEBUGGING ON\n" if $debug == 1;
>  $self->{"countSubjects"} = 0;
>
>  # Added a hashtable storing the hits for each query sequence
>  my %tmp;
>  $self->{hits_by_queryId} = \%tmp;
>  $self->{queryIndex} = 0;
>
>  return $self;
>}
>
>
>
>
>##DataStructure returned by parseBlast...must have -noseqs on commandline
>##new datastructure....
>##{queryLength},{queryName}
>##{subjects} = array of Subject objects...
>
>## takes in a blast result as an array and builds the entire datastructure
>sub parseBlast{
>  my $self = shift;
>  my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) = @_;
>  my %blast;
>  my ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent);
>  my $sbjct;
>  my $queryMatch;
>  my $haveQStart = 0;
>  my $haveSStart = 0;
>  my $parseType;
>  my $frame;
>  my $score;
>  my $identities;
>  my $positives;
>  
>  die 'these are the args to parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' if scalar(@{$blast}) <= 10;
>   
>  foreach (@{$blast}) {
>    #    print STDERR $_;
>    if (/^(\S*BLAST\S+)/){   ##gets the algorighm used for the query...
>      my $algo = $1;
>      print STDERR "algorithm = $algo\n" if $debug;
>      if($algo =~ /BLASTX/i){
>        $parseType = 1;
>      }else{
>        $parseType = 0;
>      }
>      print STDERR "ParseType='$parseType'\n" if $debug;
>    }
>    if (/^Query=\s*(\S+)/) {
>    
>      # Storing Hits/HSPs for previous query Id
>    
>      if (defined $self->{queryName}) {
>        my $countSubjects = $self->{countSubjects};
>        my $subjects       = $self->{subjects};
>        my %hits = (
>                    'countSubjects' => $countSubjects,
>                    'subjects'      => $subjects
>                   );
>                   
>        my $hits_by_queryId = $self->{hits_by_queryId};           
>        $hits_by_queryId->{$self->{queryName}} = \%hits;
>      }
>    
>      $self->{queryName} = $1;	##query name/id
>
>      # Reinitialisation
>      $self->{"countSubjects"} = 0 ;
>      my @tmp = ();
>      $self->{subjects} = \@tmp;                          
>      
>    } elsif (/^\s+\((\S+)\sletters\)/) {
>      $self->{queryLength} = $1; ##query length
>      $self->{queryLength} =~ s/,//g;
>    }
>    if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { ##ctxfactor catches the last one...Lambda last of rpsblast
>      my $sbjctId;
>      if (/^\>$regex/){
>        $sbjctId = $1 ? $1 : $2;
>      }else{
>        print STDERR "$self->{queryName}: Unable to match subject using regex '$regex' on:\n  $_" unless (/ctxfactor/ || /^Lambda/);
>      }
>      print STDERR "Matching subjectId = $sbjctId\n" if $debug;
>      ##      print STDERR "REGEX matched $_";
>      if ($haveQStart) {
>        print STDERR "Have last QStart:", $sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>        ##have the complete HSP...
>        ##want to add only if meets minimum reqs....do this on an HSP basis...
>        #	print $sbjct->getID()," $sStart,$sEnd: Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n";
>        if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue) {
>          print STDERR "Match meets reqs\n" if $debug;
>          if($remMaskedFromLen){  ##want to remove Xs from the match length....
>#            print STDERR "removing X from match\n";
>#            print STDERR "Before: $queryMatch\n";
>            $queryMatch =~ s/X//g;
>#            print STDERR "After: $queryMatch\n";
>            if(length($queryMatch) < 10){ ##don't remove if final length < 10
>              print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n";
>            }else{
>#              print "match before $matchLength\n";
>              $matchLength = length($queryMatch);
>#              print "match after $matchLength\n";
>            }
>          }
>          $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct;
>        } 
>        $haveQStart = 0;				##reset for next one...
>        $haveSStart = 0;
>      } 
>      if (defined $sbjct && $sbjct->countHSPs() >= 1 && $sbjct->getID() ne $self->{"queryName"}) {
>        $self->addSubject($sbjct);
>      }
>      ##return heree if end...
>      return if (/ctxfactor/ || /^Lambda/);
>      
>      ##      print STDERR "New Subject $sbjctId\n";
>      $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId;
>      ##      $sbjct->setID($sbjctId);
>      
>    }
>    if (/^\s+Length\s=\s(\S+)/) {
>      if($sbjct){
>        my $sbjctLength = $1;
>        $sbjctLength =~ s/,//g;
>        $sbjct->setLength($sbjctLength);
>        $sbjct->setQueryLength($self->{"queryLength"});	##set the query length for use in 
>      }
>      ##end remaining calculations
>    }
>    if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) {
>      my $tmpScore = $1;
>      my $tmpValue = $2;
>
>      ##need to add here if $haveQStart != 0;
>      if ($haveQStart) {
>        ##have the complete HSP...
>        ##want to add only if meets minimum reqs....do this on an HSP basis...
>        print STDERR "Adding Sbjct (score) ",$sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>        if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue && $sbjct) {
>          print STDERR "Adding Sbjct\n" if $debug;
>          if($remMaskedFromLen){  ##want to remove Xs from the match length....
>#            print STDERR "removing X from match\n";
>#            print STDERR "Before: $queryMatch\n";
>            $queryMatch =~ s/X//g;
>#            print STDERR "After: $queryMatch\n";
>            if(length($queryMatch) < 10){ ##don't remove if final length < 10
>              print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n";
>            }else{
>#              print "match before $matchLength\n";
>              $matchLength = length($queryMatch);
>#              print "match after $matchLength\n";
>            }
>          }
>          $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives); 
>        }
>        $haveQStart = 0;				##reset for next one...
>        $haveSStart = 0;
>      }
>      $pValue = $tmpValue;
>      $score = $tmpScore;
>    }
>    ##get the strand in the Identities line as the following messes things up!!!
>    ##get the strand...will work for all blast types...
>#    if (/^\s*(Minus|Plus)\sStrand/){
>#      $strand = $1;
>#    }
>    ##following specific to blastx
>    if ($parseType && /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) {
>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $frame = $5;
>      $dir = $frame =~ /^\+/ ? 1 : 0;
>      ##following specific to blastn
>    }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) {
>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $strand = $5;
>      $dir = $strand eq 'Minus' ? 0 : 1;
>      ##following for blastp the default if others not matched...
>    }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){
>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4;
>      $dir = 1;  ##always 1 for blastp which is what this catches...
>    }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast
>      $frame = $1;
>      $dir = $frame =~ /^\+/ ? 1 : 0;
>    }
>
>    if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) {
>      print STDERR "Matching query: $1 $3\n" if $debug;
>      if ($haveQStart == 0) {
>        $queryMatch = "";
>        $qStart = $1; 
>        $haveQStart = 1;
>      }
>      $qEnd = $3;
>      my $tmpQMatch = $2;
>      $tmpQMatch =~ s/\s//g;  ##gets rid of any spaces
>      $queryMatch .= $tmpQMatch;
>    } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) {
>      if ($haveSStart == 0) {
>        $sStart = $1; 
>        $haveSStart = 1;
>      }
>      $sEnd = $2;
>    }
>  }
>  
>  # Storing Hits/HSPs for the last query Id
>    
>  if (defined $self->{queryName}) {
>    my $countSubjects = $self->{countSubjects};
>    my $subjects       = $self->{subjects};
>    my %hits = (
>                'countSubjects' => $countSubjects,
>                'subjects'      => $subjects
>               );
>    my $hits_by_queryId = $self->{hits_by_queryId};           
>    $hits_by_queryId->{$self->{queryName}} = \%hits;
>  }
>  
>  return;
>}  
>
>sub setQueryLength {
>  my($self,$l) = @_;
>  $self->{queryLength} = $l;
>}
>
>sub getQueryLength {
>  my($self) = @_;
>  return $self->{queryLength};
>}
>
>sub sortSubjectsByMinQueryStart{
>  my $self = shift;
>  if (!exists $self->{"sortByQStart"}) {
>    @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> $b->getMinQueryStart()} $self->getSubjects();
>  }
>  return @{$self->{"sortByQStart"}};
>}
>
>sub sortSubjectsByMaxQueryEnd{
>  my $self = shift;
>  if (!exists $self->{"sortByQEnd"}) {
>    @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> $b->getMaxQueryEnd()} $self->getSubjects();
>  }
>  return @{$self->{"sortByQEnd"}};
>}
>
>##detecting chimeras
>############################################################
>## Breakpoint = 0          * OK
>## Breakpoint = 1          > [Ch, AS, OK, !]
>##   Dangling Ends = none  > [Ch, AS, OK] 
>##     Spanned = 1         * AS, OK
>##     Spanned = 0         > [Ch, AS, OK]
>##       Deep = 1          * Ch
>##       Deep = 0          * Ch, AS, OK (review)
>##   Dangling Ends = uni   > [Ch, As]
>##     Spanned = 1         > [Ch, AS]
>##       QvsSub = 1        * Ch
>##       QvsSub = 0        * AS
>##     Spanned = 0         > [Ch, AS]
>##       Deep = 1          * Ch
>##       Deep = 0          * Ch, AS (review)
>##   Dangling Ends = bi    > [Ch, !]
>##     Spanned = 1         > [Ch, !]
>##       QvsSub = 1        * Ch
>##       QvsSub = 0        * ! (review) ##not sure about this one....should it be OK
>##     Spanned = 0         * Ch
>############################################################    
>
>##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 dangling, 4=review 2 dangling ends.
>##add to ichimera obj...
>sub detectChimera{
>  my($self,$slop) = @_;					##$slop is the amount of slop in the breakpoint de to blast
>  ##loop through each possible chimera breakpoint and test if chimera
>  my $qvssub = 1;
>  my @span;
>  foreach my $c ($self->findChimeraBreakpoints($slop)) {
>    
>    @span = $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop);
>    $c->{"numberSpanned"} = scalar(@span);
>
>    ##first dangling == 0
>    if ($c->getDangling($slop) == 0) { ##dangle by at least $slop
>      if (scalar(@span) > 0) {
>        $c->{"isChimera"} = 0;	##spans and no dangling ends therefore OK
>      } else {
>        if ($c->getDepth() == 1) {
>          $c->{"isChimera"} = 1;
>        } else {
>          $c->{"isChimera"} = 2; ##will requiere review
>        }
>      }
>      
>      ##dangling on only one end...
>    } elsif ($c->getDangling($slop) == 1) {
>      if (scalar(@span) > 0) {
>        $qvssub = 1;						##are very similar....looking for one that isn't...
>        foreach my $sp (@span) {
>          if ($sp->compareQueryVsSubject() == 0) {
>            ##good subject spanning....looks lke should be alternative splicing
>            $c->{"isChimera"} = 0;
>            $qvssub = 0;
>          }
>        }
>        $c->{"isChimera"} = 1 if $qvssub == 1; 
>      } else {
>        if ($c->getDepth() == 1) {
>          $c->{"isChimera"} = 1;
>        } else {								#will need to review....
>          $c->{"isChimera"} = 3;
>        }
>      }
>      
>      ##dangling on two ends...
>    } elsif ($c->getDangling($slop) == 2) {
>      
>      if (scalar(@span) > 0) {
>        $qvssub = 1;						##are very similar....looking for one that isn't...
>        foreach my $sp (@span) {
>          if ($sp->compareQueryVsSubject() == 0) {
>            ##unclear here....need to review:
>            $c->{"isChimera"} = 4;
>            $qvssub = 0;
>            ##            last;
>          }
>        }
>        $c->{"isChimera"} = 1 if $qvssub == 1; 
>      } else {
>        $c->{"isChimera"} = 1;  ##is a chimera
>      }
>      
>    } else {  
>      print "ERROR: getDangling returned invalid value\n";
>    }
>  }
>  ##now need to go through each of chimeras and output results...
>  my $maxChim = 0;
>  my $isChimera = 0;
>  my $errString = "";
>  if (exists $self->{"chimeras"}) {
>    ##    $errString = "Chimeras for $self->{'queryName'}\n";
>    my $cnt = 1;
>    foreach my $chim (@{$self->{"chimeras"}}) {
>      last if $chim eq "";			##this is just a holder that indicates that the findChimera
>      ##method has been run and there are not chimeras...
>      $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > $maxChim;
>      $isChimera = 1 if $chim->{'isChimera'} == 1;
>      $errString .= "  Chimera $cnt: $chim->{'isChimera'}, danglingEnds=".$chim->getDangling($slop).", numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", breakpoint=".$chim->getBreakpoint()."\n";
>      #      $errString .= $chim->toString();
>      $cnt++;
>    }
>  }
>  return ($isChimera,$maxChim,$errString);
>}
>
>sub getChimeras {
>  my($self,$slop) = @_;
>  if(!exists $self->{'chimeras'}){
>    $self->detectChimera($slop);
>  }
>  return @{$self->{'chimeras'}};
>}
>
>## returns array of Chimera objs
>sub findChimeraBreakpoints{
>  my($self,$slop) = @_;
>  if ($self->{'findChimera'} != 1) {
>    my @l = $self->sortSubjectsByMaxQueryEnd();
>    my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed
>    my $lindex = 0;
>    my $rindex = 0;
>    for (my $li = $lindex;$li < scalar(@l);$li++) {
>      last if $l[$li]->getMaxQueryEnd() - $r[$rindex]->getMinQueryStart() > $slop;
>      for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>        last if $l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart() > $slop;
>        if (abs($l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart()) <= $slop) {
>          ##create new chimera object and pass arrays into extendBrkpt()
>          ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back the index of last l and r
>          last;
>        }
>      }
>    }
>    $self->{'findChimera'} = 1;	##indicates that have done the analysis
>  }
>  if (!exists $self->{'chimeras'}) {
>    return;
>  }
>  return @{$self->{"chimeras"}};
>}
>
>sub buildChimera{
>  my($self,$lindex,$rindex,$slop) = @_;
>  my @l = $self->sortSubjectsByMaxQueryEnd();
>  my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed
>  my $chim = CBIL::Bio::Blast::Chimera->new();
>  ##now add the current and any others to the chimera
>  ## the first time in loop will add the current one to $chim
>  ##first do left
>  my $ledge = $l[$lindex]->getMaxQueryEnd();
>  for (my $li = $lindex;$li < scalar(@l);$li++) {
>    if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) {
>      $chim->addLeftSubject($l[$li]);
>    } else {
>      $lindex = $li;
>      last;
>    }
>    $lindex = $li;
>  }
>  ##then do right
>  my $redge = $r[$rindex]->getMinQueryStart();
>  for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>    if ($redge - $r[$ri]->getMinQueryStart() <= $slop) {
>      $chim->addRightSubject($r[$ri]); 
>    } else {
>      $rindex = $ri;
>      last;
>    }
>    $rindex = $ri;
>  }
>  push(@{$self->{"chimeras"}},$chim);
>  return($lindex,$rindex);
>}
>
>##returns array of subjects
>##$slop is amount by which it needs to extend beyond the brkpt.
>sub findSubjectsSpanningChimeraBreakpoint{
>  my($self,$loc,$slop) = @_;
>  my @span;
>  foreach my $s ($self->sortSubjectsByMinQueryStart()) {
>    last if $s->getMinQueryStart() > $loc - $slop;
>    if ($s->getMaxQueryEnd() > $loc + $slop) {
>      ##      print "Spans $loc: ",$s->getMinQueryStart(), "-",$s->getMaxQueryEnd(),"\n";
>      push(@span,$s);
>    }
>  }
>  return @span;
>}
>
>########################################
>## Detecting unblocked repeats...
>## If No stack - OK
>## 
>## If Stack
>## 
>##    If No overhanging ends in query then OK (query is subsumed by subjects)
>##       -will block the query sequences and throw out sequences that do not have
>##             >50 bp of un-blocked sequence
>## 
>##    If Overhanging ends in query (may be only one if repeat is at end)
>## 
>##       If 1 HSP then likely REPEAT => ignore all matches that don't extend
>##          outside this region.
>##          
>##          Overhanging ends in subjects = 1 => definitely REPEAT
>## 
>##          Overhanging ends in subjects = 0 => ??very unlikely scenario..not clear
>## 
>##       If >1 HSP (could be multiple copies of same repeat or perhaps alt. splicing
>##         OR could it be a valid match that also has a repeat in it...no should in
>##         this case match entirely and have just 1 HSP)
>## 
>##          If subject locations overlap/superimpose then REPEAT 
>##             =ignore all matches that do not extend outside these locations
>##                     between them would be "outside" the locations
>##          
>##          If subject locations distinct then likely alternative splicing
>##             OR alternatively the query and subject could contain two distinct
>##               different repeats (I will be blocking for repeats so the likelihood
>##               that there are two different repeats that are both novel is small).
>## 
>##             ?do I want to do further analysis to detemine consistency with
>##                 alternative splicing model before ruling OK...?
>## ########################################
>
>##this method is the call to find repeats...should return an array of subjects to
>##ignore...
>
>## find 5' edge (minQueryStart) and gather up then check to see if there is a 
>## correlation with 3' (maxQueryEnd)
>## edge...if so is potential repeat and need to check the above rules..
>
>#what to return here....region to ignore when determining good matches...
>
>sub detectRepeats {
>  my($self,$slop) = @_;
>
>  if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt repeat...
>    return;
>  }
>
>  my @rep = $self->findRepeats($slop); ##array of repeats....
>
>  if ($debug == 1) {
>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>    foreach my $r (@rep) {
>      print STDERR $r->toString(1),"\n";
>    }
>  }
>
>  ##want to merge repeats if they are overlapping...
>  my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep;
>  undef @rep;										##use for merged ones...
>  my $merge = 0;
>  my $nr;
>  for (my $i = 0;$i<scalar(@sort) - 1;$i++) {
>    if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) {
>      ##are overlapping...merge
>      if ($merge == 0) {
>        $nr = CBIL::Bio::Blast::Repeat->new();
>        $merge = 1;
>      }
>      foreach my $s ($sort[$i]->getSubjects()) {
>        $nr->addSubject($s);
>      }
>    } elsif ($merge == 1) {
>      $merge = 0;
>      foreach my $s ($sort[$i]->getSubjects()) {
>        $nr->addSubject($s);
>      }
>      push(@rep,$nr);
>    } else {
>      ##does not overlap so just add to rep
>      push(@rep,$sort[$i]);
>    }
>  }
>  ##need to do last if $merge == 1;
>  if ($merge == 1) {						##need to add the i+1 (last) case
>    foreach my $s ($sort[-1]->getSubjects()) {
>      $nr->addSubject($s);
>    }
>    push(@rep,$nr);
>  }
>  ##repeats merged and should be non-overlapping
>
>  if ($debug == 1) {
>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>    foreach my $r (@rep) {
>      print STDERR $r->toString(1),"\n";
>    }
>  }
>  return @rep;
>}
>
>##finds all repeat stacks..
># populates array of subjects that are repeats ... @{$self->{'repeats'}}
>
>sub findRepeats{
>  my($self,$slop) = @_;
>  my $rep;
>  if (!exists $self->{'repeats'}) {
>    my $firstElement = 1;
>    my $inRepeat = 0;
>    my @s = $self->sortSubjectsByMinQueryStart();
>    ###NOTE:  want to only take repeat from a single point rather than each minQuery start
>    my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge
>    for (my  $i = 0;$i < scalar(@s) - 1;$i++) { 
>      if ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) {
>        if ($firstElement == 1) {
>          $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat object..
>          $firstElement = 0;
>          $inRepeat = 1;
>        }
>        print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" if $debug == 1;
>        $rep->addSubject($s[$i]);
>        if ($i == scalar(@s) - 2) {	##need to add next one as am in last loop
>          $rep->addSubject($s[$i+1]);
>          my $verified = $rep->verifyRepeat($slop); ##does all the verification that this is a repeat
>          push(@{$self->{'repeats'}},$rep) if $verified == 1;
>        }
>        
>      } elsif ($inRepeat == 1) { ##need this to get the $i+1 as am working one ahead
>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>        $firstElement = 1;			##now looking for the first element
>        $inRepeat = 0;
>        ##add the last one nd then verify repeats....Repeat method..
>        ##if really a repeat then add to @$self->{'repeats'}
>        $rep->addSubject($s[$i]);
>        print STDERR "Repeat contains ",$rep->countSubjects(), " subjects\n" if $debug == 1;
>        $rep->verifyRepeat($slop); ##does all the verification that this is a repeat
>        push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 && $rep->getVerified() == 1);
>      } else {
>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>        $firstElement = 1;			##now looking for the first element
>      }
>    }
>    
>    if (!exists $self->{'repeats'}) {	##there aren't any....make 0
>      $self->{'repeats'}->[0] = "";
>    }
>  }
>  if ($self->{'repeats'}->[0] eq "") {
>    return;
>  }
>  return @{$self->{'repeats'}};
>}
>
>##NOTE:  the repeat object is initially populated with things that have a common
>## left edge.  It is the job of the Repeat to then verify that this constitutes
>## a repeat and remove any subjects that don't meet ehe rules for a repeat stack above
>#3will need a 'verified' bit to know if the repeat bj is verified...else verify
>## before doing anyhing else.
>
>
>sub getQuery5endMis{
>  my $self = shift;
>  if (!exists $self->{"query5pMis"}) {
>    $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest
>    foreach my $s ($self->getSubjects()) {
>      if ($self->{"query5pMis"} > $s->getQuery5endMis()) {
>        $self->{"query5pMis"} = $s->getQuery5endMis();
>      } 
>    }
>  }
>  return $self->{"query5pMis"};
>}
>
>sub getQuery3endMis{
>  my $self = shift;
>  if (!exists $self->{"query3pMis"}) {
>    $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest
>    foreach my $s ($self->getSubjects()) {
>      if ($self->{"query3pMis"} > $s->getQuery3endMis()) {
>        $self->{"query3pMis"} = $s->getQuery3endMis();
>      } 
>    }
>  }
>  return $self->{"query3pMis"};
>}
>
>##returns 0 if leftover at ends and 1 if covered to ends
>sub getCoveredBySubjects{
>  my $self = shift;
>  my $left = shift;							##amount to be leftover....
>  if (! $left) { $left = 20; }
>  if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() <= $left) {
>    return 1;
>  }
>  return 0;
>}
>
>sub analyzeSubjectsBySubject{
>  my $self = shift;
>  return if $self->getSubjectCount() == 0;
>  my @singletons;
>  my %multiples;
>  my $haveImmuno = 0;						##keeps track if is immunoglobulin gene...
>  my $order = 0;								##order
>  my $orientation = 0;					##orientation
>  my $subOver = 0;							##number of subject overlaps
>  my $subGaps = 0;							##number of subject Gaps
>  my @multiples;
>  foreach my $s ($self->getSubjects()) {
>    my $id = $s->getID();
>    #    if($id <= 100009){
>    #      $haveImmuno = 1;
>    #      next;  ##don't want to include these in the analysis
>    #    }
>    if ($s->countHSPs() == 1) {	##don't print out singletons HSPS
>      push(@singletons,$id);
>      next;
>    }
>    push(@multiples,$id);
>    print "\n\>DT.",$id,", Length=",$s->getLength(),", Num HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1;
>    ($subOver,$subGaps) = $s->checkSubjectGaps();
>    print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" if $debug == 1;
>    $order = $s->checkOrderConsistency();
>    $orientation = $s->checkOrientationConsistency();
>    print STDERR $s->toString() if $order + $orientation != 2;
>    $multiples{$id} = [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples!
>  }
>  print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1;
>
>  #  if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";}
>  return(\@singletons,\%multiples,$haveImmuno);
>}
>
>sub getStats{
>  my $self = shift;
>  my $minNum = shift;						##minimum number of HSPs
>  my $ret;
>  #  $minNum eq "" ? $minNum = 1 : $minNum;
>  print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>  if ($self->{"countSubjects"} == 0) {
>    $ret = "\tNo Hits\n";
>    return $ret;
>  }
>  foreach my $s ($self->getSubjects()) {
>    next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS
>    $ret .= $s->toString();
>  }
>  return $ret;
>}
>
>sub printStats {
>  my($self,$minNum) = @_;
>  print $self->getStats($minNum);
>}
>
>sub addSubject{
>  my($self,$sbjct) = @_;
>  push(@{$self->{subjects}},$sbjct);
>  $self->{"countSubjects"}++;
>}
>
>sub getSubjectCount{
>  my $self = shift;
>  # return $self->{"countSubjects"};
>  
>  my $queryId = shift;
>  my $hits_by_queryId = $self->{hits_by_queryId};
>  my $hits = $hits_by_queryId->{$queryId};
>  return $hits->{countSubjects};
>}
>
>sub getSubjects{
>  my $self = shift;
>  # return if $self->{"countSubjects"} == 0;
>  # return @{$self->{subjects}};
>  
>  my $queryId = shift;
>  my $hits_by_queryId = $self->{hits_by_queryId};
>  my $hits     = $hits_by_queryId->{$queryId};
>  my $count    = $hits->{countSubjects};
>  my $subjects = $hits->{subjects};
>  
>  return if ($count == 0);  
>  return @$subjects;
>}
>
>sub getQueryID{
>  my $self = shift;  
>  # return $self->{"queryName"};
>  
>  my $index = $self->{queryIndex};
>  my $hits_by_queryId = $self->{hits_by_queryId};
>  my @queryIds = keys (%$hits_by_queryId);
>  if ($index < @queryIds) {
>    $self->{queryIndex} = $index + 1;
>    my $queryId = $queryIds[$index];
>    return $queryId;
>  }
>  else {
>    return undef;
>  }  
>}
>
>sub printThreshHoldHits {
>  my $self = shift;
>  return if $self->{"countSubjects"} == 0;
>  foreach my $s ($self->getSubjects()) {
>    print $self->getQueryID(), "\t",$s->getID(),"\n";
>  }
>}
>
>sub printGeneModels{
>  my $self = shift;
>  my @cgm;
>  print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>  if ($self->{"countSubjects"} == 0) {
>    print "\tNo Hits\n";
>    return;
>  }
>  foreach my $s ($self->getSubjects()) {
>    print "\n",$s->toString();
>    @cgm = $s->getConsistentGeneModel();
>    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar(@cgm), " HSPs\n";
>    foreach my $h (@cgm) {
>      print "    ",$h->toString();
>    }
>  }
>}
>
>sub compareGeneModelWithTotCons{
>  my $self = shift;
>  #  print "\nQueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>  my $cgmgt1 = 0;
>  my $corCGM = 0;
>  my $totCor = 0;
>  my $totSubs = 0;
>  my @vals;
>  foreach my $s ($self->getSubjects()) {
>    next if $s->countHSPs() < 2; ##don't do singletons
>    $totSubs++;
>    #    print "\n",$s->toString();
>    #    print $s->getID(),": ",$s->countHSPs()," HSPs\n";
>    #    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n";
>    my($subOver,$subGaps) = $s->checkSubjectGaps();
>    @vals = ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0);
>    #    print "  Old Analysis: Order=$vals[0], Orient=$vals[1], Over=$vals[2], Gaps=$vals[3]\n";
>    $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4;
>    $corCGM += $s->checkGoodCGM();
>    $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1;
>  }
>  return($totSubs,$corCGM,$totCor,$cgmgt1);
>}
>
>
>1;
>  
>