#2 Avoid loading genome into memory by using indexed fasta

Milestone: 1.0

Status: closed

Owner: Giuseppe Narzisi

Labels: None

Priority: 1

Updated: 2014-05-27

Created: 2014-05-24

Creator: Miika Ahdesmaki

Private: No

Hi there,
Here's a way to speed up Scalpel, especially when there are fewer regions to consider. This is by way of samtools and an indexed genome. Should also lower memory consumption.

In FindVariants.pl remove lines:
:::perl

my %genome;

loadGenomeFasta($REF, \%genome);

die "Undefined sequence ($chr)\n" if (!exists($genome{$chr}));

my $seq = substr($genome{$chr}->{seq}, $left-1, $right-$left+1);

for my $k (keys %genome) { delete $genome{$k}; }

Add these lines to replace the above my $seq
:::perl

my ($header, $seq) = split(/\n/, `samtools faidx $REF $chr:$left-$right`, 2);

seq =~ s/[\n\r\s]+//g;

(this would be easier on Github with pull requests!)

Discussion

Giuseppe Narzisi - 2014-05-27

Thank you for your feedback!
Yes, this edits to the code will produce a speed up when working on very few regions. However, this might not be advisable for a very large number of regions (~millions).
Also it requires "samtools" to be installed and available at command line by all the users.
I might add this patch in the future as an optional feature/parameter...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giuseppe Narzisi - 2014-05-27

status: open --> accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Miika Ahdesmaki - 2014-05-27

Thanks Giuseppe,
Bcbio-nextgen parallelises variant calling by splitting the bam files and bed regions and then only submits small bits and pieces to the individual callers (multiple times in multiple threads) so I'll keep this edit in my fork for now.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giuseppe Narzisi - 2014-05-27

status: accepted --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Avoid loading genome into memory by using indexed fasta

Genetic variants discovery tool

Milestone

Searches

Help

#2 Avoid loading genome into memory by using indexed fasta

my %genome;

loadGenomeFasta($REF, \%genome);

die "Undefined sequence ($chr)\n" if (!exists($genome{$chr}));

my $seq = substr($genome{$chr}->{seq}, $left-1, $right-$left+1);

for my $k (keys %genome) { delete $genome{$k}; }

my ($header, $seq) = split(/\n/, `samtools faidx $REF $chr:$left-$right`, 2);

seq =~ s/[\n\r\s]+//g;

Discussion

Avoid loading genome into memory by using indexed fasta

Genetic variants discovery tool

Milestone

Searches

Help

#2 Avoid loading genome into memory by using indexed fasta

my %genome;

loadGenomeFasta($REF, \%genome);

die "Undefined sequence ($chr)\n" if (!exists($genome{$chr}));

my $seq = substr($genome{$chr}->{seq}, $left-1, $right-$left+1);

for my $k (keys %genome) { delete $genome{$k}; }

my ($header, $seq) = split(/\n/, samtools faidx $REF $chr:$left-$right, 2);

seq =~ s/[\n\r\s]+//g;

Discussion

my ($header, $seq) = split(/\n/, `samtools faidx $REF $chr:$left-$right`, 2);