Menu

#2 Avoid loading genome into memory by using indexed fasta

1.0
closed
None
1
2014-05-27
2014-05-24
No

Hi there,
Here's a way to speed up Scalpel, especially when there are fewer regions to consider. This is by way of samtools and an indexed genome. Should also lower memory consumption.

In FindVariants.pl remove lines:
:::perl

my %genome;

loadGenomeFasta($REF, \%genome);

die "Undefined sequence ($chr)\n" if (!exists($genome{$chr}));

my $seq = substr($genome{$chr}->{seq}, $left-1, $right-$left+1);

for my $k (keys %genome) { delete $genome{$k}; }

Add these lines to replace the above my $seq
:::perl

my ($header, $seq) = split(/\n/, samtools faidx $REF $chr:$left-$right, 2);

seq =~ s/[\n\r\s]+//g;

(this would be easier on Github with pull requests!)

Discussion

  • Giuseppe Narzisi

    Thank you for your feedback!
    Yes, this edits to the code will produce a speed up when working on very few regions. However, this might not be advisable for a very large number of regions (~millions).
    Also it requires "samtools" to be installed and available at command line by all the users.
    I might add this patch in the future as an optional feature/parameter...

     
  • Giuseppe Narzisi

    • status: open --> accepted
     
  • Miika Ahdesmaki

    Miika Ahdesmaki - 2014-05-27

    Thanks Giuseppe,
    Bcbio-nextgen parallelises variant calling by splitting the bam files and bed regions and then only submits small bits and pieces to the individual callers (multiple times in multiple threads) so I'll keep this edit in my fork for now.

     
  • Giuseppe Narzisi

    • status: accepted --> closed
     

Log in to post a comment.