[Gusdev-gusdev] Re: LoadBlastSimFast issue

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

arnaud-

ok, would you like to do the upgrade to LoadBlastSimFast?

if i understand correctly this would avoid having to modify the schema, 
is that right?

steve

Arnaud Kerhornou wrote:

>
> Steve Fischer wrote:
>
>> Arnaud-
>>
>> ok, i've looked at LoadBlastSimFast.pm.     I see the addition of the 
>> logic to use name to get the query object (and i know that we 
>> discussed this in mail back in august).
>>
>> I am having some second thoughts about that change as it stands. 
>> The original intent of the plugin was that the sequences submitted to 
>> the blast process have been extracted from the database and therefore 
>> have the primary key in their definition line.
>>
>> I think I understand that it would be useful to be able to skip that 
>> step, ie, blast sequences using their native identifiers, and then 
>> have the plugin discover what their internal primary key is.    
>> That's what you want to do right?
>>
> that's right.
>
>> Does anybody know of any reason why that would not be ok?
>>
>> Assuming that nobody has any objections, maybe the best solution 
>> would be to improve the plugin to take optional arguments that 
>> specify the name of the query and/or subject identifier 
>> attributes?    For example:  -queryIdAttr source_id.   This would 
>> give us full flexibility (and also avoid the slightly risky 
>> assumption that a digits-only identifier must be the primary key)
>>
> that sounds sensible.
>
>> steve
>>
>> Arnaud Kerhornou wrote:
>>
>>>
>>> Steve Fischer wrote:
>>>
>>>> Arnaud-
>>>>
>>>> see below.
>>>>
>>>> steve
>>>>
>>>> Arnaud Kerhornou wrote:
>>>>
>>>>> Hi everyone
>>>>>
>>>>> To be able to reproduce the OrthoMCL method, I would like to raise 
>>>>> two issues we've got:
>>>>>
>>>>> * The first issue relates to the view where are stored the protein 
>>>>> sequences. I was thinking to use the TranslatedAASequence view as 
>>>>> this one contains the translated sequences of our gene models. The 
>>>>> problem I have is that it is missing a name attribute so I can not 
>>>>> match the blast output query and subject names with the data into 
>>>>> GUS (I didn't want to use the TranslatedAASequence primary keys as 
>>>>> the identifiers of my proteins of interest).
>>>>> Could we add a name attribute to this view ?
>>>>
>>>>
>>>>
>>>>
>>>> hmm.   not quite following.   what would this name be, where would 
>>>> it be derived from?
>>>
>>>
>>>
>>> By default we assign the systematic id of the corresponding CDS to 
>>> the protein name.
>>>
>>>>   why not use source_id and/or secondary_identifier? 
>>>
>>>
>>>
>>> We could do that, but in any case that would involve to modify the 
>>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to get 
>>> the sequences entries. At the moment the match is made on the 
>>> primary key (which I want to avoid) or the name attribute. The 
>>> source_id attribute would do instead of the name attribute. It must 
>>> work for any blast (DNA Vs DNA or Protein Vs Protein) with the 
>>> various potential GUS sequence objects we want to attach similarity 
>>> data to. As far as I can see the source_id attribute is present in 
>>> all of them (AASequenceImp and NASequenceImp tables).
>>>
>>>>   or, presumably this translated sequence has a relationship back 
>>>> to its na sequence (although i don't immediately see that in the 
>>>> schema browser), so couldn't you get a name or source_id from there?
>>>>
>>> That would require a more sophisticate query to get the sequence entry.
>>>
>>>>>
>>>>> * The second issue relates to the BLAST output parsing, done by a 
>>>>> module called BlastAnal.pm in the CBIL package. This module seems 
>>>>> to parse BLAST output file with only one query sequence. I have 
>>>>> more than one query sequence reported so I had to change the code 
>>>>> of this module to allow more than one query sequence. Can my code 
>>>>> be integrated  to CBIL package ? Note that I didn't change the 
>>>>> interface of this module so it doesn't affect the scripts that are 
>>>>> using it, I'm thinking in particular of 
>>>>> parseBlastFilesForSimilarity.pl
>>>>
>>>>
>>>>
>>>> this sounds ok.   how about we just take a quick look at this 
>>>> together while you are visiting?   then we can fold it into the 
>>>> code base.   do you want to send it by mail?
>>>>
>>> That's fine, the module is attached.
>>>
>>>>>
>>>>> cheers
>>>>> Arnaud
>>>>>
>>>>>
>>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> #!/usr/bin/perl
>>>
>>> ## perl module for dealing with blast results...
>>> ## will parse both  --noseqs and normal blast.
>>> ## does not parse out the alignments, rather meta information for 
>>> each subject and hsp
>>>
>>> ## Brian Brunk 5/17/99
>>>
>>> package CBIL::Bio::Blast::BlastAnal;
>>>
>>> use CBIL::Bio::Blast::Subject;
>>> use CBIL::Bio::Blast::HSP;
>>> use CBIL::Bio::Blast::Chimera;
>>> use CBIL::Bio::Blast::Repeat;
>>>
>>> use strict;
>>>
>>> my $debug = 0;
>>>
>>> ##simple constructor...need to make more robust if inheriting for 
>>> other purposes
>>> sub new {
>>>  my $self = {};
>>>  my $class = shift;
>>>  my $d = shift;
>>>  bless $self,$class;
>>>  $debug = defined $d ? $d : 0;
>>>  print STDERR "DEBUGGING ON\n" if $debug == 1;
>>>  $self->{"countSubjects"} = 0;
>>>
>>>  # Added a hashtable storing the hits for each query sequence
>>>  my %tmp;
>>>  $self->{hits_by_queryId} = \%tmp;
>>>  $self->{queryIndex} = 0;
>>>
>>>  return $self;
>>> }
>>>
>>>
>>>
>>>
>>> ##DataStructure returned by parseBlast...must have -noseqs on 
>>> commandline
>>> ##new datastructure....
>>> ##{queryLength},{queryName}
>>> ##{subjects} = array of Subject objects...
>>>
>>> ## takes in a blast result as an array and builds the entire 
>>> datastructure
>>> sub parseBlast{
>>>  my $self = shift;
>>>  my($minLength,$minPercent,$minPvalue,$regex,$blast,$remMaskedFromLen,$rpsblast) 
>>> = @_;
>>>  my %blast;
>>>  my 
>>> ($strand,$dir,$qStart,$qEnd,$sStart,$sEnd,$pValue,$matchLength,$matchPercent); 
>>>
>>>  my $sbjct;
>>>  my $queryMatch;
>>>  my $haveQStart = 0;
>>>  my $haveSStart = 0;
>>>  my $parseType;
>>>  my $frame;
>>>  my $score;
>>>  my $identities;
>>>  my $positives;
>>>  
>>>  die 'these are the args to 
>>> parseBlastNoSeqs($minLength,$minPercent,$minPvalue,$regex,$blast)\n' 
>>> if scalar(@{$blast}) <= 10;
>>>    foreach (@{$blast}) {
>>>    #    print STDERR $_;
>>>    if (/^(\S*BLAST\S+)/){   ##gets the algorighm used for the query...
>>>      my $algo = $1;
>>>      print STDERR "algorithm = $algo\n" if $debug;
>>>      if($algo =~ /BLASTX/i){
>>>        $parseType = 1;
>>>      }else{
>>>        $parseType = 0;
>>>      }
>>>      print STDERR "ParseType='$parseType'\n" if $debug;
>>>    }
>>>    if (/^Query=\s*(\S+)/) {
>>>         # Storing Hits/HSPs for previous query Id
>>>         if (defined $self->{queryName}) {
>>>        my $countSubjects = $self->{countSubjects};
>>>        my $subjects       = $self->{subjects};
>>>        my %hits = (
>>>                    'countSubjects' => $countSubjects,
>>>                    'subjects'      => $subjects
>>>                   );
>>>                          my $hits_by_queryId = 
>>> $self->{hits_by_queryId};                  
>>> $hits_by_queryId->{$self->{queryName}} = \%hits;
>>>      }
>>>         $self->{queryName} = $1;    ##query name/id
>>>
>>>      # Reinitialisation
>>>      $self->{"countSubjects"} = 0 ;
>>>      my @tmp = ();
>>>      $self->{subjects} = \@tmp;                                  } 
>>> elsif (/^\s+\((\S+)\sletters\)/) {
>>>      $self->{queryLength} = $1; ##query length
>>>      $self->{queryLength} =~ s/,//g;
>>>    }
>>>    if (/^\>$regex/ || /ctxfactor/ || ($rpsblast && /^Lambda/)) { 
>>> ##ctxfactor catches the last one...Lambda last of rpsblast
>>>      my $sbjctId;
>>>      if (/^\>$regex/){
>>>        $sbjctId = $1 ? $1 : $2;
>>>      }else{
>>>        print STDERR "$self->{queryName}: Unable to match subject 
>>> using regex '$regex' on:\n  $_" unless (/ctxfactor/ || /^Lambda/);
>>>      }
>>>      print STDERR "Matching subjectId = $sbjctId\n" if $debug;
>>>      ##      print STDERR "REGEX matched $_";
>>>      if ($haveQStart) {
>>>        print STDERR "Have last QStart:", $sbjct->getID()," 
>>> SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, 
>>> Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>>>        ##have the complete HSP...
>>>        ##want to add only if meets minimum reqs....do this on an HSP 
>>> basis...
>>>        #    print $sbjct->getID()," $sStart,$sEnd: 
>>> Length=$matchLength, Percent=$matchPercent, pValue=$pValue\n";
>>>        if ($matchLength >= $minLength && $matchPercent >= 
>>> $minPercent && $pValue <= $minPvalue) {
>>>          print STDERR "Match meets reqs\n" if $debug;
>>>          if($remMaskedFromLen){  ##want to remove Xs from the match 
>>> length....
>>> #            print STDERR "removing X from match\n";
>>> #            print STDERR "Before: $queryMatch\n";
>>>            $queryMatch =~ s/X//g;
>>> #            print STDERR "After: $queryMatch\n";
>>>            if(length($queryMatch) < 10){ ##don't remove if final 
>>> length < 10
>>>              print STDERR "RemoveMaskedResiduesFromLength: 
>>> error...length = ".length($queryMatch)." following removal of X's\n";
>>>            }else{
>>> #              print "match before $matchLength\n";
>>>              $matchLength = length($queryMatch);
>>> #              print "match after $matchLength\n";
>>>            }
>>>          }
>>>          
>>> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, 
>>> $pValue, $dir, $frame, $score,$identities,$positives) if $sbjct;
>>>        }        $haveQStart = 0;                ##reset for next one...
>>>        $haveSStart = 0;
>>>      }      if (defined $sbjct && $sbjct->countHSPs() >= 1 && 
>>> $sbjct->getID() ne $self->{"queryName"}) {
>>>        $self->addSubject($sbjct);
>>>      }
>>>      ##return heree if end...
>>>      return if (/ctxfactor/ || /^Lambda/);
>>>           ##      print STDERR "New Subject $sbjctId\n";
>>>      $sbjct = CBIL::Bio::Blast::Subject->new($sbjctId) if $sbjctId;
>>>      ##      $sbjct->setID($sbjctId);
>>>         }
>>>    if (/^\s+Length\s=\s(\S+)/) {
>>>      if($sbjct){
>>>        my $sbjctLength = $1;
>>>        $sbjctLength =~ s/,//g;
>>>        $sbjct->setLength($sbjctLength);
>>>        $sbjct->setQueryLength($self->{"queryLength"});    ##set the 
>>> query length for use in      }
>>>      ##end remaining calculations
>>>    }
>>>    if (/^\s*Score\s=\s+(\d+).*=\s(\S+)$/ || ($rpsblast && 
>>> /^\s*Score\s*=\s*\d+\sbits\s\((\d+).*=\s(\S+)$/)) {
>>>      my $tmpScore = $1;
>>>      my $tmpValue = $2;
>>>
>>>      ##need to add here if $haveQStart != 0;
>>>      if ($haveQStart) {
>>>        ##have the complete HSP...
>>>        ##want to add only if meets minimum reqs....do this on an HSP 
>>> basis...
>>>        print STDERR "Adding Sbjct (score) ",$sbjct->getID()," 
>>> SS:$sStart,SE:$sEnd, QS:$qStart, QE:$qEnd, Length=$matchLength, 
>>> Percent=$matchPercent, pValue=$pValue, Frame=$frame\n" if $debug == 1;
>>>        if ($matchLength >= $minLength && $matchPercent >= 
>>> $minPercent && $pValue <= $minPvalue && $sbjct) {
>>>          print STDERR "Adding Sbjct\n" if $debug;
>>>          if($remMaskedFromLen){  ##want to remove Xs from the match 
>>> length....
>>> #            print STDERR "removing X from match\n";
>>> #            print STDERR "Before: $queryMatch\n";
>>>            $queryMatch =~ s/X//g;
>>> #            print STDERR "After: $queryMatch\n";
>>>            if(length($queryMatch) < 10){ ##don't remove if final 
>>> length < 10
>>>              print STDERR "RemoveMaskedResiduesFromLength: 
>>> error...length = ".length($queryMatch)." following removal of X's\n";
>>>            }else{
>>> #              print "match before $matchLength\n";
>>>              $matchLength = length($queryMatch);
>>> #              print "match after $matchLength\n";
>>>            }
>>>          }
>>>          
>>> $sbjct->addHSP($qStart,$qEnd,$sStart,$sEnd,$matchLength,$matchPercent, 
>>> $pValue, $dir, $frame, $score,$identities,$positives);        }
>>>        $haveQStart = 0;                ##reset for next one...
>>>        $haveSStart = 0;
>>>      }
>>>      $pValue = $tmpValue;
>>>      $score = $tmpScore;
>>>    }
>>>    ##get the strand in the Identities line as the following messes 
>>> things up!!!
>>>    ##get the strand...will work for all blast types...
>>> #    if (/^\s*(Minus|Plus)\sStrand/){
>>> #      $strand = $1;
>>> #    }
>>>    ##following specific to blastx
>>>    if ($parseType && 
>>> /^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Frame\s=\s.*(.\d)$/) 
>>> {
>>>      $identities = $1; $matchLength = $2; $matchPercent = $3; 
>>> $positives = $4; $frame = $5;
>>>      $dir = $frame =~ /^\+/ ? 1 : 0;
>>>      ##following specific to blastn
>>>    }elsif 
>>> (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+).*Strand\s=\s(\w+)/) 
>>> {
>>>      $identities = $1; $matchLength = $2; $matchPercent = $3; 
>>> $positives = $4; $strand = $5;
>>>      $dir = $strand eq 'Minus' ? 0 : 1;
>>>      ##following for blastp the default if others not matched...
>>>    }elsif 
>>> (/^\s*Identities\s=\s(\d+)\/(\d+)\s\((\d+)\%\),\sPositives\s=\s(\d+)/){
>>>      $identities = $1; $matchLength = $2; $matchPercent = $3; 
>>> $positives = $4;
>>>      $dir = 1;  ##always 1 for blastp which is what this catches...
>>>    }elsif (/^\s*Frame\s=\s(.\d)/){ ##frame for rpsblast
>>>      $frame = $1;
>>>      $dir = $frame =~ /^\+/ ? 1 : 0;
>>>    }
>>>
>>>    if (/^Query:\s+(\d+)\s(.*)\s(\d+)\s*$/) {
>>>      print STDERR "Matching query: $1 $3\n" if $debug;
>>>      if ($haveQStart == 0) {
>>>        $queryMatch = "";
>>>        $qStart = $1;        $haveQStart = 1;
>>>      }
>>>      $qEnd = $3;
>>>      my $tmpQMatch = $2;
>>>      $tmpQMatch =~ s/\s//g;  ##gets rid of any spaces
>>>      $queryMatch .= $tmpQMatch;
>>>    } elsif (/^Sbjct:\s+(\d+)\s.*\s(\d+)\s*$/) {
>>>      if ($haveSStart == 0) {
>>>        $sStart = $1;        $haveSStart = 1;
>>>      }
>>>      $sEnd = $2;
>>>    }
>>>  }
>>>  
>>>  # Storing Hits/HSPs for the last query Id
>>>     if (defined $self->{queryName}) {
>>>    my $countSubjects = $self->{countSubjects};
>>>    my $subjects       = $self->{subjects};
>>>    my %hits = (
>>>                'countSubjects' => $countSubjects,
>>>                'subjects'      => $subjects
>>>               );
>>>    my $hits_by_queryId = $self->{hits_by_queryId};              
>>> $hits_by_queryId->{$self->{queryName}} = \%hits;
>>>  }
>>>  
>>>  return;
>>> } 
>>> sub setQueryLength {
>>>  my($self,$l) = @_;
>>>  $self->{queryLength} = $l;
>>> }
>>>
>>> sub getQueryLength {
>>>  my($self) = @_;
>>>  return $self->{queryLength};
>>> }
>>>
>>> sub sortSubjectsByMinQueryStart{
>>>  my $self = shift;
>>>  if (!exists $self->{"sortByQStart"}) {
>>>    @{$self->{"sortByQStart"}} = sort{$a->getMinQueryStart() <=> 
>>> $b->getMinQueryStart()} $self->getSubjects();
>>>  }
>>>  return @{$self->{"sortByQStart"}};
>>> }
>>>
>>> sub sortSubjectsByMaxQueryEnd{
>>>  my $self = shift;
>>>  if (!exists $self->{"sortByQEnd"}) {
>>>    @{$self->{"sortByQEnd"}} = sort{$a->getMaxQueryEnd() <=> 
>>> $b->getMaxQueryEnd()} $self->getSubjects();
>>>  }
>>>  return @{$self->{"sortByQEnd"}};
>>> }
>>>
>>> ##detecting chimeras
>>> ############################################################
>>> ## Breakpoint = 0          * OK
>>> ## Breakpoint = 1          > [Ch, AS, OK, !]
>>> ##   Dangling Ends = none  > [Ch, AS, OK] ##     Spanned = 1         
>>> * AS, OK
>>> ##     Spanned = 0         > [Ch, AS, OK]
>>> ##       Deep = 1          * Ch
>>> ##       Deep = 0          * Ch, AS, OK (review)
>>> ##   Dangling Ends = uni   > [Ch, As]
>>> ##     Spanned = 1         > [Ch, AS]
>>> ##       QvsSub = 1        * Ch
>>> ##       QvsSub = 0        * AS
>>> ##     Spanned = 0         > [Ch, AS]
>>> ##       Deep = 1          * Ch
>>> ##       Deep = 0          * Ch, AS (review)
>>> ##   Dangling Ends = bi    > [Ch, !]
>>> ##     Spanned = 1         > [Ch, !]
>>> ##       QvsSub = 1        * Ch
>>> ##       QvsSub = 0        * ! (review) ##not sure about this 
>>> one....should it be OK
>>> ##     Spanned = 0         * Ch
>>> ############################################################   
>>> ##return values: 0=OK, 1=chimera, 2=review no dangling, 3=review 1 
>>> dangling, 4=review 2 dangling ends.
>>> ##add to ichimera obj...
>>> sub detectChimera{
>>>  my($self,$slop) = @_;                    ##$slop is the amount of 
>>> slop in the breakpoint de to blast
>>>  ##loop through each possible chimera breakpoint and test if chimera
>>>  my $qvssub = 1;
>>>  my @span;
>>>  foreach my $c ($self->findChimeraBreakpoints($slop)) {
>>>       @span = 
>>> $self->findSubjectsSpanningChimeraBreakpoint($c->getBreakpoint(),$slop); 
>>>
>>>    $c->{"numberSpanned"} = scalar(@span);
>>>
>>>    ##first dangling == 0
>>>    if ($c->getDangling($slop) == 0) { ##dangle by at least $slop
>>>      if (scalar(@span) > 0) {
>>>        $c->{"isChimera"} = 0;    ##spans and no dangling ends 
>>> therefore OK
>>>      } else {
>>>        if ($c->getDepth() == 1) {
>>>          $c->{"isChimera"} = 1;
>>>        } else {
>>>          $c->{"isChimera"} = 2; ##will requiere review
>>>        }
>>>      }
>>>           ##dangling on only one end...
>>>    } elsif ($c->getDangling($slop) == 1) {
>>>      if (scalar(@span) > 0) {
>>>        $qvssub = 1;                        ##are very 
>>> similar....looking for one that isn't...
>>>        foreach my $sp (@span) {
>>>          if ($sp->compareQueryVsSubject() == 0) {
>>>            ##good subject spanning....looks lke should be 
>>> alternative splicing
>>>            $c->{"isChimera"} = 0;
>>>            $qvssub = 0;
>>>          }
>>>        }
>>>        $c->{"isChimera"} = 1 if $qvssub == 1;      } else {
>>>        if ($c->getDepth() == 1) {
>>>          $c->{"isChimera"} = 1;
>>>        } else {                                #will need to review....
>>>          $c->{"isChimera"} = 3;
>>>        }
>>>      }
>>>           ##dangling on two ends...
>>>    } elsif ($c->getDangling($slop) == 2) {
>>>           if (scalar(@span) > 0) {
>>>        $qvssub = 1;                        ##are very 
>>> similar....looking for one that isn't...
>>>        foreach my $sp (@span) {
>>>          if ($sp->compareQueryVsSubject() == 0) {
>>>            ##unclear here....need to review:
>>>            $c->{"isChimera"} = 4;
>>>            $qvssub = 0;
>>>            ##            last;
>>>          }
>>>        }
>>>        $c->{"isChimera"} = 1 if $qvssub == 1;      } else {
>>>        $c->{"isChimera"} = 1;  ##is a chimera
>>>      }
>>>         } else {       print "ERROR: getDangling returned invalid 
>>> value\n";
>>>    }
>>>  }
>>>  ##now need to go through each of chimeras and output results...
>>>  my $maxChim = 0;
>>>  my $isChimera = 0;
>>>  my $errString = "";
>>>  if (exists $self->{"chimeras"}) {
>>>    ##    $errString = "Chimeras for $self->{'queryName'}\n";
>>>    my $cnt = 1;
>>>    foreach my $chim (@{$self->{"chimeras"}}) {
>>>      last if $chim eq "";            ##this is just a holder that 
>>> indicates that the findChimera
>>>      ##method has been run and there are not chimeras...
>>>      $maxChim = $chim->{'isChimera'} if $chim->{'isChimera'} > 
>>> $maxChim;
>>>      $isChimera = 1 if $chim->{'isChimera'} == 1;
>>>      $errString .= "  Chimera $cnt: $chim->{'isChimera'}, 
>>> danglingEnds=".$chim->getDangling($slop).", 
>>> numberSpanned=$chim->{'numberSpanned'}, depth=".$chim->getDepth().", 
>>> numberLeft=".$chim->countLeft().", numRight=".$chim->countRight().", 
>>> breakpoint=".$chim->getBreakpoint()."\n";
>>>      #      $errString .= $chim->toString();
>>>      $cnt++;
>>>    }
>>>  }
>>>  return ($isChimera,$maxChim,$errString);
>>> }
>>>
>>> sub getChimeras {
>>>  my($self,$slop) = @_;
>>>  if(!exists $self->{'chimeras'}){
>>>    $self->detectChimera($slop);
>>>  }
>>>  return @{$self->{'chimeras'}};
>>> }
>>>
>>> ## returns array of Chimera objs
>>> sub findChimeraBreakpoints{
>>>  my($self,$slop) = @_;
>>>  if ($self->{'findChimera'} != 1) {
>>>    my @l = $self->sortSubjectsByMaxQueryEnd();
>>>    my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want 
>>> right one reversed
>>>    my $lindex = 0;
>>>    my $rindex = 0;
>>>    for (my $li = $lindex;$li < scalar(@l);$li++) {
>>>      last if $l[$li]->getMaxQueryEnd() - 
>>> $r[$rindex]->getMinQueryStart() > $slop;
>>>      for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>>>        last if $l[$li]->getMaxQueryEnd() - 
>>> $r[$ri]->getMinQueryStart() > $slop;
>>>        if (abs($l[$li]->getMaxQueryEnd() - 
>>> $r[$ri]->getMinQueryStart()) <= $slop) {
>>>          ##create new chimera object and pass arrays into extendBrkpt()
>>>          ($li,$ri) = $self->buildChimera($li,$ri,$slop); ##pass back 
>>> the index of last l and r
>>>          last;
>>>        }
>>>      }
>>>    }
>>>    $self->{'findChimera'} = 1;    ##indicates that have done the 
>>> analysis
>>>  }
>>>  if (!exists $self->{'chimeras'}) {
>>>    return;
>>>  }
>>>  return @{$self->{"chimeras"}};
>>> }
>>>
>>> sub buildChimera{
>>>  my($self,$lindex,$rindex,$slop) = @_;
>>>  my @l = $self->sortSubjectsByMaxQueryEnd();
>>>  my @r = reverse($self->sortSubjectsByMinQueryStart()); ##want right 
>>> one reversed
>>>  my $chim = CBIL::Bio::Blast::Chimera->new();
>>>  ##now add the current and any others to the chimera
>>>  ## the first time in loop will add the current one to $chim
>>>  ##first do left
>>>  my $ledge = $l[$lindex]->getMaxQueryEnd();
>>>  for (my $li = $lindex;$li < scalar(@l);$li++) {
>>>    if ($l[$li]->getMaxQueryEnd() - $ledge <= $slop) {
>>>      $chim->addLeftSubject($l[$li]);
>>>    } else {
>>>      $lindex = $li;
>>>      last;
>>>    }
>>>    $lindex = $li;
>>>  }
>>>  ##then do right
>>>  my $redge = $r[$rindex]->getMinQueryStart();
>>>  for (my $ri = $rindex;$ri < scalar(@r);$ri++) {
>>>    if ($redge - $r[$ri]->getMinQueryStart() <= $slop) {
>>>      $chim->addRightSubject($r[$ri]);    } else {
>>>      $rindex = $ri;
>>>      last;
>>>    }
>>>    $rindex = $ri;
>>>  }
>>>  push(@{$self->{"chimeras"}},$chim);
>>>  return($lindex,$rindex);
>>> }
>>>
>>> ##returns array of subjects
>>> ##$slop is amount by which it needs to extend beyond the brkpt.
>>> sub findSubjectsSpanningChimeraBreakpoint{
>>>  my($self,$loc,$slop) = @_;
>>>  my @span;
>>>  foreach my $s ($self->sortSubjectsByMinQueryStart()) {
>>>    last if $s->getMinQueryStart() > $loc - $slop;
>>>    if ($s->getMaxQueryEnd() > $loc + $slop) {
>>>      ##      print "Spans $loc: ",$s->getMinQueryStart(), 
>>> "-",$s->getMaxQueryEnd(),"\n";
>>>      push(@span,$s);
>>>    }
>>>  }
>>>  return @span;
>>> }
>>>
>>> ########################################
>>> ## Detecting unblocked repeats...
>>> ## If No stack - OK
>>> ## ## If Stack
>>> ## ##    If No overhanging ends in query then OK (query is subsumed 
>>> by subjects)
>>> ##       -will block the query sequences and throw out sequences 
>>> that do not have
>>> ##             >50 bp of un-blocked sequence
>>> ## ##    If Overhanging ends in query (may be only one if repeat is 
>>> at end)
>>> ## ##       If 1 HSP then likely REPEAT => ignore all matches that 
>>> don't extend
>>> ##          outside this region.
>>> ##          ##          Overhanging ends in subjects = 1 => 
>>> definitely REPEAT
>>> ## ##          Overhanging ends in subjects = 0 => ??very unlikely 
>>> scenario..not clear
>>> ## ##       If >1 HSP (could be multiple copies of same repeat or 
>>> perhaps alt. splicing
>>> ##         OR could it be a valid match that also has a repeat in 
>>> it...no should in
>>> ##         this case match entirely and have just 1 HSP)
>>> ## ##          If subject locations overlap/superimpose then REPEAT 
>>> ##             =ignore all matches that do not extend outside these 
>>> locations
>>> ##                     between them would be "outside" the locations
>>> ##          ##          If subject locations distinct then likely 
>>> alternative splicing
>>> ##             OR alternatively the query and subject could contain 
>>> two distinct
>>> ##               different repeats (I will be blocking for repeats 
>>> so the likelihood
>>> ##               that there are two different repeats that are both 
>>> novel is small).
>>> ## ##             ?do I want to do further analysis to detemine 
>>> consistency with
>>> ##                 alternative splicing model before ruling OK...?
>>> ## ########################################
>>>
>>> ##this method is the call to find repeats...should return an array 
>>> of subjects to
>>> ##ignore...
>>>
>>> ## find 5' edge (minQueryStart) and gather up then check to see if 
>>> there is a ## correlation with 3' (maxQueryEnd)
>>> ## edge...if so is potential repeat and need to check the above rules..
>>>
>>> #what to return here....region to ignore when determining good 
>>> matches...
>>>
>>> sub detectRepeats {
>>>  my($self,$slop) = @_;
>>>
>>>  if ($self->getSubjectCount() < 5 ) { ##need at least 5 to detecxt 
>>> repeat...
>>>    return;
>>>  }
>>>
>>>  my @rep = $self->findRepeats($slop); ##array of repeats....
>>>
>>>  if ($debug == 1) {
>>>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>>>    foreach my $r (@rep) {
>>>      print STDERR $r->toString(1),"\n";
>>>    }
>>>  }
>>>
>>>  ##want to merge repeats if they are overlapping...
>>>  my @sort = sort{$a->getRepeatStart() <=> $b->getRepeatStart()} @rep;
>>>  undef @rep;                                        ##use for merged 
>>> ones...
>>>  my $merge = 0;
>>>  my $nr;
>>>  for (my $i = 0;$i<scalar(@sort) - 1;$i++) {
>>>    if ($sort[$i+1]->getRepeatStart() < $sort[$i]->getRepeatEnd()) {
>>>      ##are overlapping...merge
>>>      if ($merge == 0) {
>>>        $nr = CBIL::Bio::Blast::Repeat->new();
>>>        $merge = 1;
>>>      }
>>>      foreach my $s ($sort[$i]->getSubjects()) {
>>>        $nr->addSubject($s);
>>>      }
>>>    } elsif ($merge == 1) {
>>>      $merge = 0;
>>>      foreach my $s ($sort[$i]->getSubjects()) {
>>>        $nr->addSubject($s);
>>>      }
>>>      push(@rep,$nr);
>>>    } else {
>>>      ##does not overlap so just add to rep
>>>      push(@rep,$sort[$i]);
>>>    }
>>>  }
>>>  ##need to do last if $merge == 1;
>>>  if ($merge == 1) {                        ##need to add the i+1 
>>> (last) case
>>>    foreach my $s ($sort[-1]->getSubjects()) {
>>>      $nr->addSubject($s);
>>>    }
>>>    push(@rep,$nr);
>>>  }
>>>  ##repeats merged and should be non-overlapping
>>>
>>>  if ($debug == 1) {
>>>    print STDERR "detectRepeats returns ",scalar(@rep), " repeats:\n";
>>>    foreach my $r (@rep) {
>>>      print STDERR $r->toString(1),"\n";
>>>    }
>>>  }
>>>  return @rep;
>>> }
>>>
>>> ##finds all repeat stacks..
>>> # populates array of subjects that are repeats ... 
>>> @{$self->{'repeats'}}
>>>
>>> sub findRepeats{
>>>  my($self,$slop) = @_;
>>>  my $rep;
>>>  if (!exists $self->{'repeats'}) {
>>>    my $firstElement = 1;
>>>    my $inRepeat = 0;
>>>    my @s = $self->sortSubjectsByMinQueryStart();
>>>    ###NOTE:  want to only take repeat from a single point rather 
>>> than each minQuery start
>>>    my $lEdge = $s[0]->getMinQueryStart(); ##this is the first edge
>>>    for (my  $i = 0;$i < scalar(@s) - 1;$i++) {      if 
>>> ($s[$i+1]->getMinQueryStart() - $lEdge <= $slop) {
>>>        if ($firstElement == 1) {
>>>          $rep = CBIL::Bio::Blast::Repeat->new(); ##create new Repeat 
>>> object..
>>>          $firstElement = 0;
>>>          $inRepeat = 1;
>>>        }
>>>        print STDERR "REPEAT: adding subject ",$s[$i]->getID(),"\n" 
>>> if $debug == 1;
>>>        $rep->addSubject($s[$i]);
>>>        if ($i == scalar(@s) - 2) {    ##need to add next one as am 
>>> in last loop
>>>          $rep->addSubject($s[$i+1]);
>>>          my $verified = $rep->verifyRepeat($slop); ##does all the 
>>> verification that this is a repeat
>>>          push(@{$self->{'repeats'}},$rep) if $verified == 1;
>>>        }
>>>             } elsif ($inRepeat == 1) { ##need this to get the $i+1 
>>> as am working one ahead
>>>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>>>        $firstElement = 1;            ##now looking for the first 
>>> element
>>>        $inRepeat = 0;
>>>        ##add the last one nd then verify repeats....Repeat method..
>>>        ##if really a repeat then add to @$self->{'repeats'}
>>>        $rep->addSubject($s[$i]);
>>>        print STDERR "Repeat contains ",$rep->countSubjects(), " 
>>> subjects\n" if $debug == 1;
>>>        $rep->verifyRepeat($slop); ##does all the verification that 
>>> this is a repeat
>>>        push(@{$self->{'repeats'}},$rep) if($rep->countSubjects() > 1 
>>> && $rep->getVerified() == 1);
>>>      } else {
>>>        $lEdge = $s[$i]->getMinQueryStart(); ##new left edge
>>>        $firstElement = 1;            ##now looking for the first 
>>> element
>>>      }
>>>    }
>>>       if (!exists $self->{'repeats'}) {    ##there aren't any....make 0
>>>      $self->{'repeats'}->[0] = "";
>>>    }
>>>  }
>>>  if ($self->{'repeats'}->[0] eq "") {
>>>    return;
>>>  }
>>>  return @{$self->{'repeats'}};
>>> }
>>>
>>> ##NOTE:  the repeat object is initially populated with things that 
>>> have a common
>>> ## left edge.  It is the job of the Repeat to then verify that this 
>>> constitutes
>>> ## a repeat and remove any subjects that don't meet ehe rules for a 
>>> repeat stack above
>>> #3will need a 'verified' bit to know if the repeat bj is 
>>> verified...else verify
>>> ## before doing anyhing else.
>>>
>>>
>>> sub getQuery5endMis{
>>>  my $self = shift;
>>>  if (!exists $self->{"query5pMis"}) {
>>>    $self->{"query5pMis"} = 10000; ##arbritrarily large as want smallest
>>>    foreach my $s ($self->getSubjects()) {
>>>      if ($self->{"query5pMis"} > $s->getQuery5endMis()) {
>>>        $self->{"query5pMis"} = $s->getQuery5endMis();
>>>      }    }
>>>  }
>>>  return $self->{"query5pMis"};
>>> }
>>>
>>> sub getQuery3endMis{
>>>  my $self = shift;
>>>  if (!exists $self->{"query3pMis"}) {
>>>    $self->{"query3pMis"} = 10000; ##arbritrarily large as want smallest
>>>    foreach my $s ($self->getSubjects()) {
>>>      if ($self->{"query3pMis"} > $s->getQuery3endMis()) {
>>>        $self->{"query3pMis"} = $s->getQuery3endMis();
>>>      }    }
>>>  }
>>>  return $self->{"query3pMis"};
>>> }
>>>
>>> ##returns 0 if leftover at ends and 1 if covered to ends
>>> sub getCoveredBySubjects{
>>>  my $self = shift;
>>>  my $left = shift;                            ##amount to be 
>>> leftover....
>>>  if (! $left) { $left = 20; }
>>>  if ($self->getQuery5endMis() <= $left && $self->getQuery3endMis() 
>>> <= $left) {
>>>    return 1;
>>>  }
>>>  return 0;
>>> }
>>>
>>> sub analyzeSubjectsBySubject{
>>>  my $self = shift;
>>>  return if $self->getSubjectCount() == 0;
>>>  my @singletons;
>>>  my %multiples;
>>>  my $haveImmuno = 0;                        ##keeps track if is 
>>> immunoglobulin gene...
>>>  my $order = 0;                                ##order
>>>  my $orientation = 0;                    ##orientation
>>>  my $subOver = 0;                            ##number of subject 
>>> overlaps
>>>  my $subGaps = 0;                            ##number of subject Gaps
>>>  my @multiples;
>>>  foreach my $s ($self->getSubjects()) {
>>>    my $id = $s->getID();
>>>    #    if($id <= 100009){
>>>    #      $haveImmuno = 1;
>>>    #      next;  ##don't want to include these in the analysis
>>>    #    }
>>>    if ($s->countHSPs() == 1) {    ##don't print out singletons HSPS
>>>      push(@singletons,$id);
>>>      next;
>>>    }
>>>    push(@multiples,$id);
>>>    print "\n\>DT.",$id,", Length=",$s->getLength(),", Num 
>>> HSPs=",$s->countHSPs(),($s->checkConsistency() ? " Order and Orient 
>>> Consistent" : " Order or Orient InConsistent"),"\n" if $debug == 1;
>>>    ($subOver,$subGaps) = $s->checkSubjectGaps();
>>>    print "\tSubject Overlaps = $subOver, SubjectGaps = $subGaps\n" 
>>> if $debug == 1;
>>>    $order = $s->checkOrderConsistency();
>>>    $orientation = $s->checkOrientationConsistency();
>>>    print STDERR $s->toString() if $order + $orientation != 2;
>>>    $multiples{$id} = 
>>> [$s->countHSPs(),$s->countContainedQueryHSPs(),$order,$orientation,$subOver 
>>> == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0]; ##this is the data for multiples!
>>>  }
>>>  print "\nSingletons: ",scalar(@singletons),"\n" if $debug == 1;
>>>
>>>  #  if($haveImmuno == 1){print "MATCHES immunoglobulin genes\n";}
>>>  return(\@singletons,\%multiples,$haveImmuno);
>>> }
>>>
>>> sub getStats{
>>>  my $self = shift;
>>>  my $minNum = shift;                        ##minimum number of HSPs
>>>  my $ret;
>>>  #  $minNum eq "" ? $minNum = 1 : $minNum;
>>>  print "QueryName = $self->{queryName}, QueryLength = 
>>> $self->{queryLength}\n";
>>>  if ($self->{"countSubjects"} == 0) {
>>>    $ret = "\tNo Hits\n";
>>>    return $ret;
>>>  }
>>>  foreach my $s ($self->getSubjects()) {
>>>    next if $s->countHSPs() < $minNum; ##don't print out singletons HSPS
>>>    $ret .= $s->toString();
>>>  }
>>>  return $ret;
>>> }
>>>
>>> sub printStats {
>>>  my($self,$minNum) = @_;
>>>  print $self->getStats($minNum);
>>> }
>>>
>>> sub addSubject{
>>>  my($self,$sbjct) = @_;
>>>  push(@{$self->{subjects}},$sbjct);
>>>  $self->{"countSubjects"}++;
>>> }
>>>
>>> sub getSubjectCount{
>>>  my $self = shift;
>>>  # return $self->{"countSubjects"};
>>>  
>>>  my $queryId = shift;
>>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>>  my $hits = $hits_by_queryId->{$queryId};
>>>  return $hits->{countSubjects};
>>> }
>>>
>>> sub getSubjects{
>>>  my $self = shift;
>>>  # return if $self->{"countSubjects"} == 0;
>>>  # return @{$self->{subjects}};
>>>  
>>>  my $queryId = shift;
>>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>>  my $hits     = $hits_by_queryId->{$queryId};
>>>  my $count    = $hits->{countSubjects};
>>>  my $subjects = $hits->{subjects};
>>>  
>>>  return if ($count == 0);   return @$subjects;
>>> }
>>>
>>> sub getQueryID{
>>>  my $self = shift;   # return $self->{"queryName"};
>>>  
>>>  my $index = $self->{queryIndex};
>>>  my $hits_by_queryId = $self->{hits_by_queryId};
>>>  my @queryIds = keys (%$hits_by_queryId);
>>>  if ($index < @queryIds) {
>>>    $self->{queryIndex} = $index + 1;
>>>    my $queryId = $queryIds[$index];
>>>    return $queryId;
>>>  }
>>>  else {
>>>    return undef;
>>>  }  }
>>>
>>> sub printThreshHoldHits {
>>>  my $self = shift;
>>>  return if $self->{"countSubjects"} == 0;
>>>  foreach my $s ($self->getSubjects()) {
>>>    print $self->getQueryID(), "\t",$s->getID(),"\n";
>>>  }
>>> }
>>>
>>> sub printGeneModels{
>>>  my $self = shift;
>>>  my @cgm;
>>>  print "QueryName = $self->{queryName}, QueryLength = 
>>> $self->{queryLength}\n";
>>>  if ($self->{"countSubjects"} == 0) {
>>>    print "\tNo Hits\n";
>>>    return;
>>>  }
>>>  foreach my $s ($self->getSubjects()) {
>>>    print "\n",$s->toString();
>>>    @cgm = $s->getConsistentGeneModel();
>>>    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? "Yes" : 
>>> "No"), ", ",scalar(@cgm), " HSPs\n";
>>>    foreach my $h (@cgm) {
>>>      print "    ",$h->toString();
>>>    }
>>>  }
>>> }
>>>
>>> sub compareGeneModelWithTotCons{
>>>  my $self = shift;
>>>  #  print "\nQueryName = $self->{queryName}, QueryLength = 
>>> $self->{queryLength}\n";
>>>  my $cgmgt1 = 0;
>>>  my $corCGM = 0;
>>>  my $totCor = 0;
>>>  my $totSubs = 0;
>>>  my @vals;
>>>  foreach my $s ($self->getSubjects()) {
>>>    next if $s->countHSPs() < 2; ##don't do singletons
>>>    $totSubs++;
>>>    #    print "\n",$s->toString();
>>>    #    print $s->getID(),": ",$s->countHSPs()," HSPs\n";
>>>    #    print "  CONSISTENT GENE MODEL: ",($s->checkGoodCGM() ? 
>>> "Yes" : "No"), ", ",scalar($s->getConsistentGeneModel()), " HSPs\n";
>>>    my($subOver,$subGaps) = $s->checkSubjectGaps();
>>>    @vals = 
>>> ($s->checkOrderConsistency(),$s->checkOrientationConsistency(),$subOver 
>>> == 0 ? 1 : 0,$subGaps == 0 ? 1 : 0);
>>>    #    print "  Old Analysis: Order=$vals[0], Orient=$vals[1], 
>>> Over=$vals[2], Gaps=$vals[3]\n";
>>>    $totCor++ if $vals[0]+$vals[1]+$vals[2]+$vals[3] == 4;
>>>    $corCGM += $s->checkGoodCGM();
>>>    $cgmgt1++ if scalar($s->getConsistentGeneModel()) > 1;
>>>  }
>>>  return($totSubs,$corCGM,$totCor,$cgmgt1);
>>> }
>>>
>>>
>>> 1;
>>>  
>>>