Since I only have 1 reference sequence (bacterial), I am trying to parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to different instances. When I check the results against a single instance, though, they are slightly different. Here are some examples of unique variants to each type:
Single instance: gi|16120353|ref|NC_003143.1| . A G 100 PASS DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1
With BED file (region 1-100000): gi|16120353|ref|NC_003143.1| . G A 67 PASS DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9
In total, the merged BED calls have about 15% more variants than the single instance, and the differences don't seem to be anywhere near the region borders. Both commands used lofreq call with default parameters, except the BED file. Is there some computation that depends on having all the reads? Or maybe something to do with mates? I guess an alternative is to divide up the reference and use call-parallel, but that would require post-processing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
please don't implement parallelization manually. This will mess up the
automatically applied Bonferroni correction. Use 'call-parallel'
instead. Both is explained in detail below:
The effect you see is because the automatic determined Bonferroni
factor that goes into the SNV quality filtering differs between the
two types of runs you execute: one run sees the whole chromosome (all
regions) and therefore uses a higher Bonferroni factor compared to the
several split runs using small regions and therefore using smaller
Bonferroni factors independently (which is why you see more calls).
The easiest way to achieve parallelization is to use 'call-parallel'.
This now also works on single chromosomes and supports the same
options as 'call' and requires to set the number of threads with
--pp-threads.
Having said all this, I have to admit that call-parallel in 2.0.0-rc-1
has an embarrassing bug that will make it crash when using chromosomes
with e.g. pipe characters, as in your case. I'll upload a fixed
version tomorrow. If you're impatient you could just change the
following line in src/scripts/lofreq2_call_pparallel.py
cmd += " -r %s -o %s/%d.vcf > %s/%d.log 2>&1" % (
to
cmd += ' -r "%s" -o %s/%d.vcf > %s/%d.log 2>&1' % (
Hope that helps - and if it doesn't, please don't hesitate and let me know!
Andreas
Since I only have 1 reference sequence (bacterial), I am trying to
parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to
different instances. When I check the results against a single instance,
though, they are slightly different. Here are some examples of unique
variants to each type:
Single instance:
gi|16120353|ref|NC_003143.1| . A G 100 PASS
DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1
With BED file (region 1-100000):
gi|16120353|ref|NC_003143.1| . G A 67 PASS
DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9
In total, the merged BED calls have about 15% more variants than the single
instance, and the differences don't seem to be anywhere near the region
borders. Both commands used lofreq call with default parameters, except the
BED file. Is there some computation that depends on having all the reads? Or
maybe something to do with mates? I guess an alternative is to divide up the
reference and use call-parallel, but that would require post-processing.
Thanks, that makes sense. Is single-chromosome parallelization in 2.0.0-rc-1? For me it is still exiting with a warning that there is no point because there is only one sequence header.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You're right: parallel calls on single chromosomes were not
implemented in RC-1. This was implemented for RC-2 end of May, but I
never pushed this to sourceforge. Please check again later today for a
new release. Sorry about the confusion.
Thanks, that makes sense. Is single-chromosome parallelization in
2.0.0-rc-1? For me it is still exiting with a warning that there is no point
because there is only one sequence header.
Thanks for posting the release. Version 2.0.0 (Linux) does get past the previous error, but the individual jobs now fail with "Need exactly one BAM file as last argument" in their tmp logs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Brian,
this is very likely caused by an error in the argument list, i.e. wrong
usage. Could you check again or post it here?
Thanks, Andreas
On 26 Jul 2014 01:14, "Brian Ondov" ondovb@users.sf.net wrote:
Thanks for posting the release. Version 2.0.0 (Linux) does get past the
previous error, but the individual jobs now fail with "Need exactly one BAM
file as last argument" in their tmp logs.
thanks for pointing this out! Parsing of the "--cons-as-ref" option
was indeed broken (LoFreq complains about a missing argument while it
really doesn't need one). I fixed this, but wouldn't recommend it's
use anyway, since it's a rather experimental flag that wasn't
benchmarked in a while...
Since I only have 1 reference sequence (bacterial), I am trying to parallelize
lofreq call
(2.0.0-rc-1) by providing BED regions (100k each) to different instances. When I check the results against a single instance, though, they are slightly different. Here are some examples of unique variants to each type:Single instance:
gi|16120353|ref|NC_003143.1| . A G 100 PASS DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1
With BED file (region 1-100000):
gi|16120353|ref|NC_003143.1| . G A 67 PASS DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9
In total, the merged BED calls have about 15% more variants than the single instance, and the differences don't seem to be anywhere near the region borders. Both commands used
lofreq call
with default parameters, except the BED file. Is there some computation that depends on having all the reads? Or maybe something to do with mates? I guess an alternative is to divide up the reference and usecall-parallel
, but that would require post-processing.Hi Brian,
please don't implement parallelization manually. This will mess up the
automatically applied Bonferroni correction. Use 'call-parallel'
instead. Both is explained in detail below:
The effect you see is because the automatic determined Bonferroni
factor that goes into the SNV quality filtering differs between the
two types of runs you execute: one run sees the whole chromosome (all
regions) and therefore uses a higher Bonferroni factor compared to the
several split runs using small regions and therefore using smaller
Bonferroni factors independently (which is why you see more calls).
The easiest way to achieve parallelization is to use 'call-parallel'.
This now also works on single chromosomes and supports the same
options as 'call' and requires to set the number of threads with
--pp-threads.
Having said all this, I have to admit that call-parallel in 2.0.0-rc-1
has an embarrassing bug that will make it crash when using chromosomes
with e.g. pipe characters, as in your case. I'll upload a fixed
version tomorrow. If you're impatient you could just change the
following line in src/scripts/lofreq2_call_pparallel.py
cmd += " -r %s -o %s/%d.vcf > %s/%d.log 2>&1" % (
to
cmd += ' -r "%s" -o %s/%d.vcf > %s/%d.log 2>&1' % (
Hope that helps - and if it doesn't, please don't hesitate and let me know!
Andreas
On 24 July 2014 22:06, Brian Ondov ondovb@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Thanks, that makes sense. Is single-chromosome parallelization in 2.0.0-rc-1? For me it is still exiting with a warning that there is no point because there is only one sequence header.
You're right: parallel calls on single chromosomes were not
implemented in RC-1. This was implemented for RC-2 end of May, but I
never pushed this to sourceforge. Please check again later today for a
new release. Sorry about the confusion.
Andreas
On 25 July 2014 00:21, Brian Ondov ondovb@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Thanks for posting the release. Version 2.0.0 (Linux) does get past the previous error, but the individual jobs now fail with "Need exactly one BAM file as last argument" in their tmp logs.
Hi Brian,
this is very likely caused by an error in the argument list, i.e. wrong
usage. Could you check again or post it here?
Thanks, Andreas
On 26 Jul 2014 01:14, "Brian Ondov" ondovb@users.sf.net wrote:
It actually seems to have to do with the
--cons-as-ref
option. Without that, it works. I was running as:lofreq call-parallel --cons-as-ref -f ../ref.fna -o lofreq.par.vcf --pp-threads 16 merged.bam
Hi Brian,
thanks for pointing this out! Parsing of the "--cons-as-ref" option
was indeed broken (LoFreq complains about a missing argument while it
really doesn't need one). I fixed this, but wouldn't recommend it's
use anyway, since it's a rather experimental flag that wasn't
benchmarked in a while...
Best,
Andreas
On 28 July 2014 22:54, Brian Ondov ondovb@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC