[Gusdev-gusdev] Re: LoadBlastSimFast issue

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Steve Fischer wrote:

> Arnaud-
>
> ok, i've looked at LoadBlastSimFast.pm.     I see the addition of the 
> logic to use name to get the query object (and i know that we 
> discussed this in mail back in august).
>
> I am having some second thoughts about that change as it stands.  
>
> The original intent of the plugin was that the sequences submitted to 
> the blast process have been extracted from the database and therefore 
> have the primary key in their definition line.
>
> I think I understand that it would be useful to be able to skip that 
> step, ie, blast sequences using their native identifiers, and then 
> have the plugin discover what their internal primary key is.    That's 
> what you want to do right?
>
that's right.

> Does anybody know of any reason why that would not be ok?
>
> Assuming that nobody has any objections, maybe the best solution would 
> be to improve the plugin to take optional arguments that specify the 
> name of the query and/or subject identifier attributes?    For 
> example:  -queryIdAttr source_id.   This would give us full 
> flexibility (and also avoid the slightly risky assumption that a 
> digits-only identifier must be the primary key)
>
that sounds sensible.

> steve
>
> Arnaud Kerhornou wrote:
>
>>
>> Steve Fischer wrote:
>>
>>> Arnaud-
>>>
>>> see below.
>>>
>>> steve
>>>
>>> Arnaud Kerhornou wrote:
>>>
>>>> Hi everyone
>>>>
>>>> To be able to reproduce the OrthoMCL method, I would like to raise 
>>>> two issues we've got:
>>>>
>>>> * The first issue relates to the view where are stored the protein 
>>>> sequences. I was thinking to use the TranslatedAASequence view as 
>>>> this one contains the translated sequences of our gene models. The 
>>>> problem I have is that it is missing a name attribute so I can not 
>>>> match the blast output query and subject names with the data into 
>>>> GUS (I didn't want to use the TranslatedAASequence primary keys as 
>>>> the identifiers of my proteins of interest).
>>>> Could we add a name attribute to this view ?
>>>
>>>
>>>
>>> hmm.   not quite following.   what would this name be, where would 
>>> it be derived from?
>>
>>
>> By default we assign the systematic id of the corresponding CDS to 
>> the protein name.
>>
>>>   why not use source_id and/or secondary_identifier? 
>>
>>
>> We could do that, but in any case that would involve to modify the 
>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get 
>> the sequences entries. At the moment the match is made on the primary 
>> key (which I want to avoid) or the name attribute. The source_id 
>> attribute would do instead of the name attribute. It must work for 
>> any blast (DNA Vs DNA or Protein Vs Protein) with the various 
>> potential GUS sequence objects we want to attach similarity data to. 
>> As far as I can see the source_id attribute is present in all of them 
>> (AASequenceImp and NASequenceImp tables).
>>
>>>   or, presumably this translated sequence has a relationship back to 
>>> its na sequence (although i don't immediately see that in the schema 
>>> browser), so couldn't you get a name or source_id from there?
>>>
>> That would require a more sophisticate query to get the sequence entry.
>>
>>>>
>>>> * The second issue relates to the BLAST output parsing, done by a 
>>>> module called BlastAnal.pm in the CBIL package. This module seems 
>>>> to parse BLAST output file with only one query sequence. I have 
>>>> more than one query sequence reported so I had to change the code 
>>>> of this module to allow more than one query sequence. Can my code 
>>>> be integrated  to CBIL package ? Note that I didn't change the 
>>>> interface of this module so it doesn't affect the scripts that are 
>>>> using it, I'm thinking in particular of 
>>>> parseBlastFilesForSimilarity.pl
>>>
>>>
>>> this sounds ok.   how about we just take a quick look at this 
>>> together while you are visiting?   then we can fold it into the code 
>>> base.   do you want to send it by mail?
>>>
>> That's fine, the module is attached.
>>
>>>>
>>>> cheers
>>>> Arnaud
>>>>
>>>>
>>>
>>------------------------------------------------------------------------
>>
>>#!/usr/bin/perl
>>
>>## perl module for dealing with blast results...
>>## will parse both  --noseqs and normal blast.
>>## does not parse out the alignments, rather meta information for each subject and hsp
>>
>>## Brian Brunk 5/17/99
>>
>>package CBIL::Bio::Blast::BlastAnal;
>>
>>use CBIL::Bio::Blast::Subject;
>>use CBIL::Bio::Blast::HSP;
>>use CBIL::Bio::Blast::Chimera;
>>use CBIL::Bio::Blast::Repeat;
>>
>>use strict;
>>
>>my $debug = 0;
>>
>>##simple constructor...need to make more robust if inheriting for other purposes
>>sub new {
>>  my $self = {};
>>  my $class = shift;
>>  my $d = shift;
>>  bless $self,$class;
>>  $debug = defined $d ? $d : 0;
>>  print STDERR "DEBUGGING ON\n" if $debug == 1;
>>  $self->{"countSubjects"} = 0;
>>
>>  # Added a hashtable storing the hits for each query sequence
>>  my %tmp;
>>  $self->{hits_by_queryId} = \%tmp;
>>  $self->{queryIndex} = 0;
>>
>>  return $self;
>>}
>>
>>
>>
>>
>>##DataStructure returned by parseBlast...must have -noseqs on commandline
>>##new datastructure....
>>##{queryLength},{queryName}
>>##{subjects} = array of Subject objects...
>>
>>## takes in a blast result as an array and builds the entire datastructure
>>sub parseBlast{
>>  my $self = shift;
>>  my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) = @_;
>>  my %blast;
>>  my ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent);
>>  my $sbjct;
>>  my $queryMatch;
>>  my $haveQStart = 0;
>>  my $haveSStart = 0;
>>  my $parseType;
>>  my $frame;
>>  my $score;
>>  my $identities;
>>  my $positives;
>>  
>>  die 'these are the args to parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' if scalar(@{$blast}) <= 10;
>>   
>>  foreach (@{$blast}) {
>>    #    print STDERR $_;
>>    if (/^(\S*BLAST\S+)/){   ##gets the algorighm used for the query...
>>      my $algo = $1;
>>      print STDERR "algorithm = $algo\n" if $debug;
>>      if($algo =~ /BLASTX/i){
>>        $parseType = 1;
>>      }else{
>>        $parseType = 0;
>>      }
>>      print STDERR "ParseType='$parseType'\n" if $debug;
>>    }
>>    if (/^Query=\s*(\S+)/) {
>>    
>>      # Storing Hits/HSPs for previous query Id
>>    
>>      if (defined $self->{queryName}) {
>>        my $countSubjects = $self->{countSubjects};
>>        my $subjects       = $self->{subjects};
>>        my %hits = (
>>                    'countSubjects' => $countSubjects,
>>                    'subjects'      => $subjects
>>                   );
>>                   
>>        my $hits_by_queryId = $self->{hits_by_queryId};           
>>        $hits_by_queryId->{$self->{queryName}} = \%hits;
>>      }
>>    
>>      $self->{queryName} = $1;	##query name/id
>>
>>      # Reinitialisation
>>      $self->{"countSubjects"} = 0 ;
>>      my @tmp = ();
>>      $self->{subjects} = \@tmp;                          
>>      
>>    } elsif (/^\s+\((\S+)\sletters\)/) {
>>      $self->{queryLength} = $1; ##query length
>>      $self->{queryLength} =~ s/,//g;
>>    }
>>    if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { ##ctxfactor catches the last one...Lambda last of rpsblast
>>      my $sbjctId;
>>      if (/^\>$regex/){
>>        $sbjctId = $1 ? $1 : $2;
>>      }else{
>>        print STDERR "$self->{queryName}: Unable to match subject using regex '$regex' on:\n  $_" unless (/ctxfactor/ || /^Lambda/);
>>      }
>>      print STDERR "Matching subjectId = $sbjctId\n" if $debug;
>>      ##      print STDERR "REGEX matched $_";
>>      if ($haveQStart) {
>>        print STDERR "Have last QStart:", $sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>>        ##have the complete HSP...
>>        ##want to add only if meets minimum reqs....do this on an HSP basis...
>>        #	print $sbjct->getID()," $sStart,$sEnd: Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n";
>>        if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue) {
>>          print STDERR "Match meets reqs\n" if $debug;
>>          if($remMaskedFromLen){  ##want to remove Xs from the match length....
>>#            print STDERR "removing X from match\n";
>>#            print STDERR "Before: $queryMatch\n";
>>            $queryMatch =~ s/X//g;
>>#            print STDERR "After: $queryMatch\n";
>>            if(length($queryMatch) < 10){ ##don't remove if final length < 10
>>              print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n";
>>            }else{
>>#              print "match before $matchLength\n";
>>              $matchLength = length($queryMatch);
>>#              print "match after $matchLength\n";
>>            }
>>          }
>>          $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct;
>>        } 
>>        $haveQStart = 0;				##reset for next one...
>>        $haveSStart = 0;
>>      } 
>>      if (defined $sbjct && $sbjct->countHSPs() >= 1 && $sbjct->getID() ne $self->{"queryName"}) {
>>        $self->addSubject($sbjct);
>>      }
>>      ##return heree if end...
>>      return if (/ctxfactor/ || /^Lambda/);
>>      
>>      ##      print STDERR "New Subject $sbjctId\n";
>>      $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId;
>>      ##      $sbjct->setID($sbjctId);
>>      
>>    }
>>    if (/^\s+Length\s=\s(\S+)/) {
>>      if($sbjct){
>>        my $sbjctLength = $1;
>>        $sbjctLength =~ s/,//g;
>>        $sbjct->setLength($sbjctLength);
>>        $sbjct->setQueryLength($self->{"queryLength"});	##set the query length for use in 
>>      }
>>      ##end remaining calculations
>>    }
>>    if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) {
>>      my $tmpScore = $1;
>>      my $tmpValue = $2;
>>
>>      ##need to add here if $haveQStart != 0;
>>      if ($haveQStart) {
>>        ##have the complete HSP...
>>        ##want to add only if meets minimum reqs....do this on an HSP basis...
>>        print STDERR "Adding Sbjct (score) ",$sbjct->getID()," SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>>        if ($matchLength >= $minLength && $matchPercent >= $minPercent && $pValue <= $minPvalue && $sbjct) {
>>          print STDERR "Adding Sbjct\n" if $debug;
>>          if($remMaskedFromLen){  ##want to remove Xs from the match length....
>>#            print STDERR "removing X from match\n";
>>#            print STDERR "Before: $queryMatch\n";
>>            $queryMatch =~ s/X//g;
>>#            print STDERR "After: $queryMatch\n";
>>            if(length($queryMatch) < 10){ ##don't remove if final length < 10
>>              print STDERR "RemoveMaskedResiduesFromLength: error...length = ".length($queryMatch)." following removal of X's\n";
>>            }else{
>>#              print "match before $matchLength\n";
>>              $matchLength = length($queryMatch);
>>#              print "match after $matchLength\n";
>>            }
>>          }
>>          $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, $pValue, $dir, $frame, $score,$identities,$positives); 
>>        }
>>        $haveQStart = 0;				##reset for next one...
>>        $haveSStart = 0;
>>      }
>>      $pValue = $tmpValue;
>>      $score = $tmpScore;
>>    }
>>    ##get the strand in the Identities line as the following messes things up!!!
>>    ##get the strand...will work for all blast types...
>>#    if (/^\s*(Minus|Plus)\sStrand/){
>>#      $strand = $1;
>>#    }
>>    ##following specific to blastx
>>    if ($parseType && /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) {
>>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $frame = $5;
>>      $dir = $frame =~ /^\+/ ? 1 : 0;
>>      ##following specific to blastn
>>    }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) {
>>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4; $strand = $5;
>>      $dir = $strand eq 'Minus' ? 0 : 1;
>>      ##following for blastp the default if others not matched...
>>    }elsif (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){
>>      $identities = $1; $matchLength = $2; $matchPercent = $3; $positives = $4;
>>      $dir = 1;  ##always 1 for blastp which is what this catches...
>>    }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast
>>      $frame = $1;
>>      $dir = $frame =~ /^\+/ ? 1 : 0;
>>    }
>>
>>    if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) {
>>      print STDERR "Matching query: $1 $3\n" if $debug;
>>      if ($haveQStart == 0) {
>>        $queryMatch = "";
>>        $qStart = $1; 
>>        $haveQStart = 1;
>>      }
>>      $qEnd = $3;
>>      my $tmpQMatch = $2;
>>      $tmpQMatch =~ s/\s//g;  ##gets rid of any spaces
>>      $queryMatch .= $tmpQMatch;
>>    } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) {
>>      if ($haveSStart == 0) {
>>        $sStart = $1; 
>>        $haveSStart = 1;
>>      }
>>      $sEnd = $2;
>>    }
>>  }
>>  
>>  # Storing Hits/HSPs for the last query Id
>>    
>>  if (defined $self->{queryName}) {
>>    my $countSubjects = $self->{countSubjects};
>>    my $subjects       = $self->{subjects};
>>    my %hits = (
>>                'countSubjects' => $countSubjects,
>>                'subjects'      => $subjects
>>               );
>>    my $hits_by_queryId = $self->{hits_by_queryId};           
>>    $hits_by_queryId->{$self->{queryName}} = \%hits;
>>  }
>>  
>>  return;
>>}  
>>
>>sub setQueryLength {
>>  my($self,$l) = @_;
>>  $self->{queryLength} = $l;
>>}
>>
>>sub getQueryLength {
>>  my($self) = @_;
>>  return $self->{queryLength};
>>}
>>
>>sub sortSubjectsByMinQueryStart{
>>  my $self = shift;
>>  if (!exists $self->{"sortByQStart"}) {
>>    @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> $b->getMinQueryStart()} $self->getSubjects();
>>  }
>>  return @{$self->{"sortByQStart"}};
>>}
>>
>>sub sortSubjectsByMaxQueryEnd{
>>  my $self = shift;
>>  if (!exists $self->{"sortByQEnd"}) {
>>    @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> $b->getMaxQueryEnd()} $self->getSubjects();
>>  }
>>  return @{$self->{"sortByQEnd"}};
>>}
>>
>>##detecting chimeras
>>############################################################
>>## Breakpoint = 0          * OK
>>## Breakpoint = 1          > [Ch, AS, OK, !]
>>##   Dangling Ends = none  > [Ch, AS, OK] 
>>##     Spanned = 1         * AS, OK
>>##     Spanned = 0         > [Ch, AS, OK]
>>##       Deep = 1          * Ch
>>##       Deep = 0          * Ch, AS, OK (review)
>>##   Dangling Ends = uni   > [Ch, As]
>>##     Spanned = 1         > [Ch, AS]
>>##       QvsSub = 1        * Ch
>>##       QvsSub = 0        * AS
>>##     Spanned = 0         > [Ch, AS]
>>##       Deep = 1          * Ch
>>##       Deep = 0          * Ch, AS (review)
>>##   Dangling Ends = bi    > [Ch, !]
>>##     Spanned = 1         > [Ch, !]
>>##       QvsSub = 1        * Ch
>>##       QvsSub = 0        * ! (review) ##not sure about this one....should it be OK
>>##     Spanned = 0         * Ch
>>############################################################    
>>
>>##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 dangling, 4=review 2 dangling ends.
>>##add to ichimera obj...
>>sub detectChimera{
>>  my($self,$slop) = @_;					##$slop is the amount of slop in the breakpoint de to blast
>>  ##loop through each possible chimera breakpoint and test if chimera
>>  my $qvssub = 1;
>>  my @span;
>>  foreach my $c ($self->findChimeraBreakpoints($slop)) {
>>    
>>    @span = $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop);
>>    $c->{"numberSpanned"} = scalar(@span);
>>
>>    ##first dangling == 0
>>    if ($c->getDangling($slop) == 0) { ##dangle by at least $slop
>>      if (scalar(@span) > 0) {
>>        $c->{"isChimera"} = 0;	##spans and no dangling ends therefore OK
>>      } else {
>>        if ($c->getDepth() == 1) {
>>          $c->{"isChimera"} = 1;
>>        } else {
>>          $c->{"isChimera"} = 2; ##will requiere review
>>        }
>>      }
>>      
>>      ##dangling on only one end...
>>    } elsif ($c->getDangling($slop) == 1) {
>>      if (scalar(@span) > 0) {
>>        $qvssub = 1;						##are very similar....looking for one that isn't...
>>        foreach my $sp (@span) {
>>          if ($sp->compareQueryVsSubject() == 0) {
>>            ##good subject spanning....looks lke should be alternative splicing
>>            $c->{"isChimera"} = 0;
>>            $qvssub = 0;
>>          }
>>        }
>>        $c->{"isChimera"} = 1 if $qvssub == 1; 
>>      } else {
>>        if ($c->getDepth() == 1) {
>>          $c->{"isChimera"} = 1;
>>        } else {								#will need to review....
>>          $c->{"isChimera"} = 3;
>>        }
>>      }
>>      
>>      ##dangling on two ends...
>>    } elsif ($c->getDangling($slop) == 2) {
>>      
>>      if (scalar(@span) > 0) {
>>        $qvssub = 1;						##are very similar....looking for one that isn't...
>>        foreach my $sp (@span) {
>>          if ($sp->compareQueryVsSubject() == 0) {
>>            ##unclear here....need to review:
>>            $c->{"isChimera"} = 4;
>>            $qvssub = 0;
>>            ##            last;
>>          }
>>        }
>>        $c->{"isChimera"} = 1 if $qvssub == 1; 
>>      } else {
>>        $c->{"isChimera"} = 1;  ##is a chimera
>>      }
>>      
>>    } else {  
>>      print "ERROR: getDangling returned invalid value\n";
>>    }
>>  }
>>  ##now need to go through each of chimeras and output results...
>>  my $maxChim = 0;
>>  my $isChimera = 0;
>>  my $errString = "";
>>  if (exists $self->{"chimeras"}) {
>>    ##    $errString = "Chimeras for $self->{'queryName'}\n";
>>    my $cnt = 1;
>>    foreach my $chim (@{$self->{"chimeras"}}) {
>>      last if $chim eq "";			##this is just a holder that indicates that the findChimera
>>      ##method has been run and there are not chimeras...
>>      $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > $maxChim;
>>      $isChimera = 1 if $chim->{'isChimera'} == 1;
>>      $errString .= "  Chimera $cnt: $chim->{'isChimera'}, danglingEnds=".$chim->getDangling($slop).", numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", breakpoint=".$chim->getBreakpoint()."\n";
>>      #      $errString .= $chim->toString();
>>      $cnt++;
>>    }
>>  }
>>  return ($isChimera,$maxChim,$errString);
>>}
>>
>>sub getChimeras {
>>  my($self,$slop) = @_;
>>  if(!exists $self->{'chimeras'}){
>>    $self->detectChimera($slop);
>>  }
>>  return @{$self->{'chimeras'}};
>>}
>>
>>## returns array of Chimera objs
>>sub findChimeraBreakpoints{
>>  my($self,$slop) = @_;
>>  if ($self->{'findChimera'} != 1) {
>>    my @l = $self->sortSubjectsByMaxQueryEnd();
>>    my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed
>>    my $lindex = 0;
>>    my $rindex = 0;
>>    for (my $li = $lindex;$li < scalar(@l);$li++) {
>>      last if $l[$li]->getMaxQueryEnd() - $r[$rindex]->getMinQueryStart() > $slop;
>>      for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>>        last if $l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart() > $slop;
>>        if (abs($l[$li]->getMaxQueryEnd() - $r[$ri]->getMinQueryStart()) <= $slop) {
>>          ##create new chimera object and pass arrays into extendBrkpt()
>>          ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back the index of last l and r
>>          last;
>>        }
>>      }
>>    }
>>    $self->{'findChimera'} = 1;	##indicates that have done the analysis
>>  }
>>  if (!exists $self->{'chimeras'}) {
>>    return;
>>  }
>>  return @{$self->{"chimeras"}};
>>}
>>
>>sub buildChimera{
>>  my($self,$lindex,$rindex,$slop) = @_;
>>  my @l = $self->sortSubjectsByMaxQueryEnd();
>>  my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right one reversed
>>  my $chim = CBIL::Bio::Blast::Chimera->new();
>>  ##now add the current and any others to the chimera
>>  ## the first time in loop will add the current one to $chim
>>  ##first do left
>>  my $ledge = $l[$lindex]->getMaxQueryEnd();
>>  for (my $li = $lindex;$li < scalar(@l);$li++) {
>>    if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) {
>>      $chim->addLeftSubject($l[$li]);
>>    } else {
>>      $lindex = $li;
>>      last;
>>    }
>>    $lindex = $li;
>>  }
>>  ##then do right
>>  my $redge = $r[$rindex]->getMinQueryStart();
>>  for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>>    if ($redge - $r[$ri]->getMinQueryStart() <= $slop) {
>>      $chim->addRightSubject($r[$ri]); 
>>    } else {
>>      $rindex = $ri;
>>      last;
>>    }
>>    $rindex = $ri;
>>  }
>>  push(@{$self->{"chimeras"}},$chim);
>>  return($lindex,$rindex);
>>}
>>
>>##returns array of subjects
>>##$slop is amount by which it needs to extend beyond the brkpt.
>>sub findSubjectsSpanningChimeraBreakpoint{
>>  my($self,$loc,$slop) = @_;
>>  my @span;
>>  foreach my $s ($self->sortSubjectsByMinQueryStart()) {
>>    last if $s->getMinQueryStart() > $loc - $slop;
>>    if ($s->getMaxQueryEnd() > $loc + $slop) {
>>      ##      print "Spans $loc: ",$s->getMinQueryStart(), "-",$s->getMaxQueryEnd(),"\n";
>>      push(@span,$s);
>>    }
>>  }
>>  return @span;
>>}
>>
>>########################################
>>## Detecting unblocked repeats...
>>## If No stack - OK
>>## 
>>## If Stack
>>## 
>>##    If No overhanging ends in query then OK (query is subsumed by subjects)
>>##       -will block the query sequences and throw out sequences that do not have
>>##             >50 bp of un-blocked sequence
>>## 
>>##    If Overhanging ends in query (may be only one if repeat is at end)
>>## 
>>##       If 1 HSP then likely REPEAT => ignore all matches that don't extend
>>##          outside this region.
>>##          
>>##          Overhanging ends in subjects = 1 => definitely REPEAT
>>## 
>>##          Overhanging ends in subjects = 0 => ??very unlikely scenario..not clear
>>## 
>>##       If >1 HSP (could be multiple copies of same repeat or perhaps alt. splicing
>>##         OR could it be a valid match that also has a repeat in it...no should in
>>##         this case match entirely and have just 1 HSP)
>>## 
>>##          If subject locations overlap/superimpose then REPEAT 
>>##             =ignore all matches that do not extend outside these locations
>>##                     between them would be "outside" the locations
>>##          
>>##          If subject locations distinct then likely alternative splicing
>>##             OR alternatively the query and subject could contain two distinct
>>##               different repeats (I will be blocking for repeats so the likelihood
>>##               that there are two different repeats that are both novel is small).
>>## 
>>##             ?do I want to do further analysis to detemine consistency with
>>##                 alternative splicing model before ruling OK...?
>>## ########################################
>>
>>##this method is the call to find repeats...should return an array of subjects to
>>##ignore...
>>
>>## find 5' edge (minQueryStart) and gather up then check to see if there is a 
>>## correlation with 3' (maxQueryEnd)
>>## edge...if so is potential repeat and need to check the above rules..
>>
>>#what to return here....region to ignore when determining good matches...
>>
>>sub detectRepeats {
>>  my($self,$slop) = @_;
>>
>>  if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt repeat...
>>    return;
>>  }
>>
>>  my @rep = $self->findRepeats($slop); ##array of repeats....
>>
>>  if ($debug == 1) {
>>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>>    foreach my $r (@rep) {
>>      print STDERR $r->toString(1),"\n";
>>    }
>>  }
>>
>>  ##want to merge repeats if they are overlapping...
>>  my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep;
>>  undef @rep;										##use for merged ones...
>>  my $merge = 0;
>>  my $nr;
>>  for (my $i = 0;$i<scalar(@sort) - 1;$i++) {
>>    if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) {
>>      ##are overlapping...merge
>>      if ($merge == 0) {
>>        $nr = CBIL::Bio::Blast::Repeat->new();
>>        $merge = 1;
>>      }
>>      foreach my $s ($sort[$i]->getSubjects()) {
>>        $nr->addSubject($s);
>>      }
>>    } elsif ($merge == 1) {
>>      $merge = 0;
>>      foreach my $s ($sort[$i]->getSubjects()) {
>>        $nr->addSubject($s);
>>      }
>>      push(@rep,$nr);
>>    } else {
>>      ##does not overlap so just add to rep
>>      push(@rep,$sort[$i]);
>>    }
>>  }
>>  ##need to do last if $merge == 1;
>>  if ($merge == 1) {						##need to add the i+1 (last) case
>>    foreach my $s ($sort[-1]->getSubjects()) {
>>      $nr->addSubject($s);
>>    }
>>    push(@rep,$nr);
>>  }
>>  ##repeats merged and should be non-overlapping
>>
>>  if ($debug == 1) {
>>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>>    foreach my $r (@rep) {
>>      print STDERR $r->toString(1),"\n";
>>    }
>>  }
>>  return @rep;
>>}
>>
>>##finds all repeat stacks..
>># populates array of subjects that are repeats ... @{$self->{'repeats'}}
>>
>>sub findRepeats{
>>  my($self,$slop) = @_;
>>  my $rep;
>>  if (!exists $self->{'repeats'}) {
>>    my $firstElement = 1;
>>    my $inRepeat = 0;
>>    my @s = $self->sortSubjectsByMinQueryStart();
>>    ###NOTE:  want to only take repeat from a single point rather than each minQuery start
>>    my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge
>>    for (my  $i = 0;$i < scalar(@s) - 1;$i++) { 
>>      if ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) {
>>        if ($firstElement == 1) {
>>          $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat object..
>>          $firstElement = 0;
>>          $inRepeat = 1;
>>        }
>>        print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" if $debug == 1;
>>        $rep->addSubject($s[$i]);
>>        if ($i == scalar(@s) - 2) {	##need to add next one as am in last loop
>>          $rep->addSubject($s[$i+1]);
>>          my $verified = $rep->verifyRepeat($slop); ##does all the verification that this is a repeat
>>          push(@{$self->{'repeats'}},$rep) if $verified == 1;
>>        }
>>        
>>      } elsif ($inRepeat == 1) { ##need this to get the $i+1 as am working one ahead
>>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>>        $firstElement = 1;			##now looking for the first element
>>        $inRepeat = 0;
>>        ##add the last one nd then verify repeats....Repeat method..
>>        ##if really a repeat then add to @$self->{'repeats'}
>>        $rep->addSubject($s[$i]);
>>        print STDERR "Repeat contains ",$rep->countSubjects(), " subjects\n" if $debug == 1;
>>        $rep->verifyRepeat($slop); ##does all the verification that this is a repeat
>>        push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 && $rep->getVerified() == 1);
>>      } else {
>>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>>        $firstElement = 1;			##now looking for the first element
>>      }
>>    }
>>    
>>    if (!exists $self->{'repeats'}) {	##there aren't any....make 0
>>      $self->{'repeats'}->[0] = "";
>>    }
>>  }
>>  if ($self->{'repeats'}->[0] eq "") {
>>    return;
>>  }
>>  return @{$self->{'repeats'}};
>>}
>>
>>##NOTE:  the repeat object is initially populated with things that have a common
>>## left edge.  It is the job of the Repeat to then verify that this constitutes
>>## a repeat and remove any subjects that don't meet ehe rules for a repeat stack above
>>#3will need a 'verified' bit to know if the repeat bj is verified...else verify
>>## before doing anyhing else.
>>
>>
>>sub getQuery5endMis{
>>  my $self = shift;
>>  if (!exists $self->{"query5pMis"}) {
>>    $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest
>>    foreach my $s ($self->getSubjects()) {
>>      if ($self->{"query5pMis"} > $s->getQuery5endMis()) {
>>        $self->{"query5pMis"} = $s->getQuery5endMis();
>>      } 
>>    }
>>  }
>>  return $self->{"query5pMis"};
>>}
>>
>>sub getQuery3endMis{
>>  my $self = shift;
>>  if (!exists $self->{"query3pMis"}) {
>>    $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest
>>    foreach my $s ($self->getSubjects()) {
>>      if ($self->{"query3pMis"} > $s->getQuery3endMis()) {
>>        $self->{"query3pMis"} = $s->getQuery3endMis();
>>      } 
>>    }
>>  }
>>  return $self->{"query3pMis"};
>>}
>>
>>##returns 0 if leftover at ends and 1 if covered to ends
>>sub getCoveredBySubjects{
>>  my $self = shift;
>>  my $left = shift;							##amount to be leftover....
>>  if (! $left) { $left = 20; }
>>  if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() <= $left) {
>>    return 1;
>>  }
>>  return 0;
>>}
>>
>>sub analyzeSubjectsBySubject{
>>  my $self = shift;
>>  return if $self->getSubjectCount() == 0;
>>  my @singletons;
>>  my %multiples;
>>  my $haveImmuno = 0;						##keeps track if is immunoglobulin gene...
>>  my $order = 0;								##order
>>  my $orientation = 0;					##orientation
>>  my $subOver = 0;							##number of subject overlaps
>>  my $subGaps = 0;							##number of subject Gaps
>>  my @multiples;
>>  foreach my $s ($self->getSubjects()) {
>>    my $id = $s->getID();
>>    #    if($id <= 100009){
>>    #      $haveImmuno = 1;
>>    #      next;  ##don't want to include these in the analysis
>>    #    }
>>    if ($s->countHSPs() == 1) {	##don't print out singletons HSPS
>>      push(@singletons,$id);
>>      next;
>>    }
>>    push(@multiples,$id);
>>    print "\n\>DT.",$id,", Length=",$s->getLength(),", Num HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1;
>>    ($subOver,$subGaps) = $s->checkSubjectGaps();
>>    print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" if $debug == 1;
>>    $order = $s->checkOrderConsistency();
>>    $orientation = $s->checkOrientationConsistency();
>>    print STDERR $s->toString() if $order + $orientation != 2;
>>    $multiples{$id} = [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples!
>>  }
>>  print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1;
>>
>>  #  if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";}
>>  return(\@singletons,\%multiples,$haveImmuno);
>>}
>>
>>sub getStats{
>>  my $self = shift;
>>  my $minNum = shift;						##minimum number of HSPs
>>  my $ret;
>>  #  $minNum eq "" ? $minNum = 1 : $minNum;
>>  print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>>  if ($self->{"countSubjects"} == 0) {
>>    $ret = "\tNo Hits\n";
>>    return $ret;
>>  }
>>  foreach my $s ($self->getSubjects()) {
>>    next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS
>>    $ret .= $s->toString();
>>  }
>>  return $ret;
>>}
>>
>>sub printStats {
>>  my($self,$minNum) = @_;
>>  print $self->getStats($minNum);
>>}
>>
>>sub addSubject{
>>  my($self,$sbjct) = @_;
>>  push(@{$self->{subjects}},$sbjct);
>>  $self->{"countSubjects"}++;
>>}
>>
>>sub getSubjectCount{
>>  my $self = shift;
>>  # return $self->{"countSubjects"};
>>  
>>  my $queryId = shift;
>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>  my $hits = $hits_by_queryId->{$queryId};
>>  return $hits->{countSubjects};
>>}
>>
>>sub getSubjects{
>>  my $self = shift;
>>  # return if $self->{"countSubjects"} == 0;
>>  # return @{$self->{subjects}};
>>  
>>  my $queryId = shift;
>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>  my $hits     = $hits_by_queryId->{$queryId};
>>  my $count    = $hits->{countSubjects};
>>  my $subjects = $hits->{subjects};
>>  
>>  return if ($count == 0);  
>>  return @$subjects;
>>}
>>
>>sub getQueryID{
>>  my $self = shift;  
>>  # return $self->{"queryName"};
>>  
>>  my $index = $self->{queryIndex};
>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>  my @queryIds = keys (%$hits_by_queryId);
>>  if ($index < @queryIds) {
>>    $self->{queryIndex} = $index + 1;
>>    my $queryId = $queryIds[$index];
>>    return $queryId;
>>  }
>>  else {
>>    return undef;
>>  }  
>>}
>>
>>sub printThreshHoldHits {
>>  my $self = shift;
>>  return if $self->{"countSubjects"} == 0;
>>  foreach my $s ($self->getSubjects()) {
>>    print $self->getQueryID(), "\t",$s->getID(),"\n";
>>  }
>>}
>>
>>sub printGeneModels{
>>  my $self = shift;
>>  my @cgm;
>>  print "QueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>>  if ($self->{"countSubjects"} == 0) {
>>    print "\tNo Hits\n";
>>    return;
>>  }
>>  foreach my $s ($self->getSubjects()) {
>>    print "\n",$s->toString();
>>    @cgm = $s->getConsistentGeneModel();
>>    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar(@cgm), " HSPs\n";
>>    foreach my $h (@cgm) {
>>      print "    ",$h->toString();
>>    }
>>  }
>>}
>>
>>sub compareGeneModelWithTotCons{
>>  my $self = shift;
>>  #  print "\nQueryName = $self->{queryName}, QueryLength = $self->{queryLength}\n";
>>  my $cgmgt1 = 0;
>>  my $corCGM = 0;
>>  my $totCor = 0;
>>  my $totSubs = 0;
>>  my @vals;
>>  foreach my $s ($self->getSubjects()) {
>>    next if $s->countHSPs() < 2; ##don't do singletons
>>    $totSubs++;
>>    #    print "\n",$s->toString();
>>    #    print $s->getID(),": ",$s->countHSPs()," HSPs\n";
>>    #    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n";
>>    my($subOver,$subGaps) = $s->checkSubjectGaps();
>>    @vals = ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0);
>>    #    print "  Old Analysis: Order=$vals[0], Orient=$vals[1], Over=$vals[2], Gaps=$vals[3]\n";
>>    $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4;
>>    $corCGM += $s->checkGoodCGM();
>>    $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1;
>>  }
>>  return($totSubs,$corCGM,$totCor,$cgmgt1);
>>}
>>
>>
>>1;
>>  
>>