Menu

Differing results with BED regions

2014-07-24
2014-07-29
  • Brian Ondov

    Brian Ondov - 2014-07-24

    Since I only have 1 reference sequence (bacterial), I am trying to parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to different instances. When I check the results against a single instance, though, they are slightly different. Here are some examples of unique variants to each type:

    Single instance:
    gi|16120353|ref|NC_003143.1| . A G 100 PASS DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1

    With BED file (region 1-100000):
    gi|16120353|ref|NC_003143.1| . G A 67 PASS DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9

    In total, the merged BED calls have about 15% more variants than the single instance, and the differences don't seem to be anywhere near the region borders. Both commands used lofreq call with default parameters, except the BED file. Is there some computation that depends on having all the reads? Or maybe something to do with mates? I guess an alternative is to divide up the reference and use call-parallel, but that would require post-processing.

     
    • Andreas Wilm

      Andreas Wilm - 2014-07-24

      Hi Brian,

      please don't implement parallelization manually. This will mess up the
      automatically applied Bonferroni correction. Use 'call-parallel'
      instead. Both is explained in detail below:

      The effect you see is because the automatic determined Bonferroni
      factor that goes into the SNV quality filtering differs between the
      two types of runs you execute: one run sees the whole chromosome (all
      regions) and therefore uses a higher Bonferroni factor compared to the
      several split runs using small regions and therefore using smaller
      Bonferroni factors independently (which is why you see more calls).

      The easiest way to achieve parallelization is to use 'call-parallel'.
      This now also works on single chromosomes and supports the same
      options as 'call' and requires to set the number of threads with
      --pp-threads.

      Having said all this, I have to admit that call-parallel in 2.0.0-rc-1
      has an embarrassing bug that will make it crash when using chromosomes
      with e.g. pipe characters, as in your case. I'll upload a fixed
      version tomorrow. If you're impatient you could just change the
      following line in src/scripts/lofreq2_call_pparallel.py
      cmd += " -r %s -o %s/%d.vcf > %s/%d.log 2>&1" % (
      to
      cmd += ' -r "%s" -o %s/%d.vcf > %s/%d.log 2>&1' % (

      Hope that helps - and if it doesn't, please don't hesitate and let me know!
      Andreas

      On 24 July 2014 22:06, Brian Ondov ondovb@users.sf.net wrote:

      Since I only have 1 reference sequence (bacterial), I am trying to
      parallelize lofreq call (2.0.0-rc-1) by providing BED regions (100k each) to
      different instances. When I check the results against a single instance,
      though, they are slightly different. Here are some examples of unique
      variants to each type:

      Single instance:
      gi|16120353|ref|NC_003143.1| . A G 100 PASS
      DP=4895;AF=0.006129;SB=38;DP4=3252,1612,29,1

      With BED file (region 1-100000):
      gi|16120353|ref|NC_003143.1| . G A 67 PASS
      DP=5261;AF=0.003421;SB=9;DP4=1609,3634,9,9

      In total, the merged BED calls have about 15% more variants than the single
      instance, and the differences don't seem to be anywhere near the region
      borders. Both commands used lofreq call with default parameters, except the
      BED file. Is there some computation that depends on having all the reads? Or
      maybe something to do with mates? I guess an alternative is to divide up the
      reference and use call-parallel, but that would require post-processing.


      Differing results with BED regions


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/lofreq/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --
      Andreas Wilm
      andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

       
      • Brian Ondov

        Brian Ondov - 2014-07-24

        Thanks, that makes sense. Is single-chromosome parallelization in 2.0.0-rc-1? For me it is still exiting with a warning that there is no point because there is only one sequence header.

         
        • Andreas Wilm

          Andreas Wilm - 2014-07-25

          You're right: parallel calls on single chromosomes were not
          implemented in RC-1. This was implemented for RC-2 end of May, but I
          never pushed this to sourceforge. Please check again later today for a
          new release. Sorry about the confusion.

          Andreas

          On 25 July 2014 00:21, Brian Ondov ondovb@users.sf.net wrote:

          Thanks, that makes sense. Is single-chromosome parallelization in
          2.0.0-rc-1? For me it is still exiting with a warning that there is no point
          because there is only one sequence header.


          Differing results with BED regions


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/lofreq/discussion/general/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/

          --
          Andreas Wilm
          andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

           
          • Brian Ondov

            Brian Ondov - 2014-07-25

            Thanks for posting the release. Version 2.0.0 (Linux) does get past the previous error, but the individual jobs now fail with "Need exactly one BAM file as last argument" in their tmp logs.

             
  • Brian Ondov

    Brian Ondov - 2014-07-28

    It actually seems to have to do with the --cons-as-ref option. Without that, it works. I was running as:


    lofreq call-parallel --cons-as-ref -f ../ref.fna -o lofreq.par.vcf --pp-threads 16 merged.bam

     
    • Andreas Wilm

      Andreas Wilm - 2014-07-29

      Hi Brian,

      thanks for pointing this out! Parsing of the "--cons-as-ref" option
      was indeed broken (LoFreq complains about a missing argument while it
      really doesn't need one). I fixed this, but wouldn't recommend it's
      use anyway, since it's a rather experimental flag that wasn't
      benchmarked in a while...

      Best,
      Andreas

      On 28 July 2014 22:54, Brian Ondov ondovb@users.sf.net wrote:

      It actually seems to have to do with the --cons-as-ref option. Without that,
      it works. I was running as:

      lofreq call-parallel --cons-as-ref -f ../ref.fna -o lofreq.par.vcf
      --pp-threads 16 merged.bam


      Differing results with BED regions


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/lofreq/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --
      Andreas Wilm
      andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

       

Log in to post a comment.