LoFreq / Discussion / General Discussion: Differing results with BED regions

Brian Ondov - 2014-07-24

Since I only have 1 reference sequence (bacterial), I am trying to parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to different instances. When I check the results against a single instance, though, they are slightly different. Here are some examples of unique variants to each type:

Single instance:
gi|16120353|ref|NC_003143.1| . A G 100 PASS DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1

With BED file (region 1-100000):
gi|16120353|ref|NC_003143.1| . G A 67 PASS DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9

In total, the merged BED calls have about 15% more variants than the single instance, and the differences don't seem to be anywhere near the region borders. Both commands used lofreq call with default parameters, except the BED file. Is there some computation that depends on having all the reads? Or maybe something to do with mates? I guess an alternative is to divide up the reference and use call-parallel, but that would require post-processing.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andreas Wilm - 2014-07-24
  
  Hi Brian,
  
  please don't implement parallelization manually. This will mess up the
  automatically applied Bonferroni correction. Use 'call-parallel'
  instead. Both is explained in detail below:
  
  The effect you see is because the automatic determined Bonferroni
  factor that goes into the SNV quality filtering differs between the
  two types of runs you execute: one run sees the whole chromosome (all
  regions) and therefore uses a higher Bonferroni factor compared to the
  several split runs using small regions and therefore using smaller
  Bonferroni factors independently (which is why you see more calls).
  
  The easiest way to achieve parallelization is to use 'call-parallel'.
  This now also works on single chromosomes and supports the same
  options as 'call' and requires to set the number of threads with
  --pp-threads.
  
  Having said all this, I have to admit that call-parallel in 2.0.0-rc-1
  has an embarrassing bug that will make it crash when using chromosomes
  with e.g. pipe characters, as in your case. I'll upload a fixed
  version tomorrow. If you're impatient you could just change the
  following line in src/scripts/lofreq2_call_pparallel.py
  cmd += " -r %s -o %s/%d.vcf > %s/%d.log 2>&1" % (
  to
  cmd += ' -r "%s" -o %s/%d.vcf > %s/%d.log 2>&1' % (
  
  Hope that helps - and if it doesn't, please don't hesitate and let me know!
  Andreas
  
  On 24 July 2014 22:06, Brian Ondov ondovb@users.sf.net wrote:
  
  Since I only have 1 reference sequence (bacterial), I am trying to
  parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to
  different instances. When I check the results against a single instance,
  though, they are slightly different. Here are some examples of unique
  variants to each type:
  
  Single instance:
  gi|16120353|ref|NC_003143.1| . A G 100 PASS
  DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1
  
  With BED file (region 1-100000):
  gi|16120353|ref|NC_003143.1| . G A 67 PASS
  DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9
  
  In total, the merged BED calls have about 15% more variants than the single
  instance, and the differences don't seem to be anywhere near the region
  borders. Both commands used lofreq call with default parameters, except the
  BED file. Is there some computation that depends on having all the reads? Or
  maybe something to do with mates? I guess an alternative is to divide up the
  reference and use call-parallel, but that would require post-processing.
  
  Differing results with BED regions
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/lofreq/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  Andreas Wilm
  andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Brian Ondov - 2014-07-24
    
    Thanks, that makes sense. Is single-chromosome parallelization in 2.0.0-rc-1? For me it is still exiting with a warning that there is no point because there is only one sequence header.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Andreas Wilm - 2014-07-25
      
      You're right: parallel calls on single chromosomes were not
      implemented in RC-1. This was implemented for RC-2 end of May, but I
      never pushed this to sourceforge. Please check again later today for a
      new release. Sorry about the confusion.
      
      Andreas
      
      On 25 July 2014 00:21, Brian Ondov ondovb@users.sf.net wrote:
      
      Thanks, that makes sense. Is single-chromosome parallelization in
      2.0.0-rc-1? For me it is still exiting with a warning that there is no point
      because there is only one sequence header.
      
      Differing results with BED regions
      
      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/lofreq/discussion/general/
      
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      
      --
      Andreas Wilm
      andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Brian Ondov - 2014-07-25
        
        Thanks for posting the release. Version 2.0.0 (Linux) does get past the previous error, but the individual jobs now fail with "Need exactly one BAM file as last argument" in their tmp logs.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Andreas Wilm - 2014-07-26
        
        Hi Brian,
        this is very likely caused by an error in the argument list, i.e. wrong
        usage. Could you check again or post it here?
        Thanks, Andreas
        On 26 Jul 2014 01:14, "Brian Ondov" ondovb@users.sf.net wrote:
        
        Thanks for posting the release. Version 2.0.0 (Linux) does get past the
        previous error, but the individual jobs now fail with "Need exactly one BAM
        file as last argument" in their tmp logs.
        
        Differing results with BED regions
        https://sourceforge.net/p/lofreq/discussion/general/thread/34b06833/?limit=25#44ff/11b3/3de0/9636/7572
        
        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/lofreq/discussion/general/
        
        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/
        
        alternate
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Ondov - 2014-07-28

It actually seems to have to do with the --cons-as-ref option. Without that, it works. I was running as:

lofreq call-parallel --cons-as-ref -f ../ref.fna -o lofreq.par.vcf --pp-threads 16 merged.bam

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andreas Wilm - 2014-07-29
  
  Hi Brian,
  
  thanks for pointing this out! Parsing of the "--cons-as-ref" option
  was indeed broken (LoFreq complains about a missing argument while it
  really doesn't need one). I fixed this, but wouldn't recommend it's
  use anyway, since it's a rather experimental flag that wasn't
  benchmarked in a while...
  
  Best,
  Andreas
  
  On 28 July 2014 22:54, Brian Ondov ondovb@users.sf.net wrote:
  
  It actually seems to have to do with the --cons-as-ref option. Without that,
  it works. I was running as:
  
  lofreq call-parallel --cons-as-ref -f ../ref.fna -o lofreq.par.vcf
  --pp-threads 16 merged.bam
  
  Differing results with BED regions
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/lofreq/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  Andreas Wilm
  andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Differing results with BED regions

Fast and sensitive variant-calling from sequencing data

Forums

Help

Differing results with BED regions

Differing results with BED regions

Fast and sensitive variant-calling from sequencing data

Forums

Help

Differing results with BED regions document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Differing results with BED regions