Menu

Quality filtering

2014-03-31
2014-04-02
  • jessica preston

    jessica preston - 2014-03-31

    Hi,

    I have been calling rare variants with lofreq and it seems to be working great. However, I am having trouble figuring out how to set a quality threshold for calling the variants. The data I have is very high quality, with average base quality scores above Q60. (Data is Illumina 1.8+) This is a higher quality than the raw Illumina output; the high quality scores are generated by the program SeqPrep.

    I want to use lofreq to call high quality, rare SNPs. I've tried using these options with the lofreq_snpcaller.py:

    -noncons-filter-qual 60 --noncons-default-qual 60

    However I am still getting lots of variants called with normal quality data (Q30). Is this something that can be done with Lofreq or should I quality filter before running Lofreq? Am I using the right encoding for the quality scores?

    Any help is greatly appreciated!!

    Thanks.

     
    • Andreas Wilm

      Andreas Wilm - 2014-04-01

      Hi Jessica,

      getting average base quality scores above Q60 in FastQs file doesn't
      sound right. The base qualities were probably Illumina 1.3-1.7 encoded
      (ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
      Sanger encoded; ASCII+33). In other words the values are 31 too high.
      My guess is that would have to convince SeqPrep that your files had a
      different encoding (option -6 I think but refer to SeqPrep's manual).
      You can confirm the encoding by running FastQC on your original FastQ
      files.

      As a direct consequence LoFreq will probably overpredict SNVs. Once
      the base qualities are corrected, there should be no need to change
      any base filtering settings: they work just fine for Illumina and
      there's also no need to filter SNVs later-on since they are already
      filtered based on multiple testing corrected p-values.

      Hope that helps,
      Andreas

      On 1 April 2014 07:50, jessica preston jpreston555@users.sf.net wrote:

      Hi,

      I have been calling rare variants with lofreq and it seems to be working
      great. However, I am having trouble figuring out how to set a quality
      threshold for calling the variants. The data I have is very high quality,
      with average base quality scores above Q60. (Data is Illumina 1.8+) This is
      a higher quality than the raw Illumina output; the high quality scores are
      generated by the program SeqPrep.

      I want to use lofreq to call high quality, rare SNPs. I've tried using these
      options with the lofreq_snpcaller.py:

      -noncons-filter-qual 60 --noncons-default-qual 60

      However I am still getting lots of variants called with normal quality data
      (Q30). Is this something that can be done with Lofreq or should I quality
      filter before running Lofreq? Am I using the right encoding for the quality
      scores?

      Any help is greatly appreciated!!

      Thanks.


      Quality filtering


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/lofreq/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --
      Andreas Wilm
      andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

       
      • Andreas Wilm

        Andreas Wilm - 2014-04-01

        Actually, let me take back what I just wrote. The quality encoding is
        probably correct and SeqPrep merges base-qualities in overlapping
        reads, thus producing high qualities. You kind of said that already,
        but I didn't read your email thoroughly enough. Sorry.

        If you really want to remove non-reference bases below those high
        averages values then you can use the special settings you mention. If
        you in addition also want to ignore reference bases below that value,
        then you would also have to use --ignore-bases-below-q 60.

        Technically, neither of these settings will be necessary though,
        because LoFreq interprets qualities as what they are, i.e. error
        probabilities and those are build into its prediction model. LoFreq
        should work fine without these special settings.

        When you say you get "lots of variants called with normal quality data
        (Q30)" you mean the SNV quality or base quality Q30? Getting a SNVs
        with quality 30 is not necessarily unusual even with high quality
        bases.

        Andreas

        On 1 April 2014 09:43, Andreas Wilm andreas.wilm@gmail.com wrote:

        Hi Jessica,

        getting average base quality scores above Q60 in FastQs file doesn't
        sound right. The base qualities were probably Illumina 1.3-1.7 encoded
        (ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
        Sanger encoded; ASCII+33). In other words the values are 31 too high.
        My guess is that would have to convince SeqPrep that your files had a
        different encoding (option -6 I think but refer to SeqPrep's manual).
        You can confirm the encoding by running FastQC on your original FastQ
        files.

        As a direct consequence LoFreq will probably overpredict SNVs. Once
        the base qualities are corrected, there should be no need to change
        any base filtering settings: they work just fine for Illumina and
        there's also no need to filter SNVs later-on since they are already
        filtered based on multiple testing corrected p-values.

        Hope that helps,
        Andreas

        On 1 April 2014 07:50, jessica preston jpreston555@users.sf.net wrote:

        Hi,

        I have been calling rare variants with lofreq and it seems to be working
        great. However, I am having trouble figuring out how to set a quality
        threshold for calling the variants. The data I have is very high quality,
        with average base quality scores above Q60. (Data is Illumina 1.8+) This is
        a higher quality than the raw Illumina output; the high quality scores are
        generated by the program SeqPrep.

        I want to use lofreq to call high quality, rare SNPs. I've tried using these
        options with the lofreq_snpcaller.py:

        -noncons-filter-qual 60 --noncons-default-qual 60

        However I am still getting lots of variants called with normal quality data
        (Q30). Is this something that can be done with Lofreq or should I quality
        filter before running Lofreq? Am I using the right encoding for the quality
        scores?

        Any help is greatly appreciated!!

        Thanks.


        Quality filtering


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/lofreq/discussion/general/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

        --
        Andreas Wilm
        andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

        --
        Andreas Wilm
        andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

         
      • jessica preston

        jessica preston - 2014-04-01

        Thanks for responding so quickly!
        I didn't explain well enough what I am doing. I am developing a low-error sequencing method where I use paired-end sequencing to generate completely overlapping short sequence reads in order to reduce error rates. This way a sequencing error is obvious because it appears in only one overlapping read, and therefore error rates are exponentially reduced. Seqprep is a program that uses the overlap information to determine a new quality score based on the overlap information. It was originally used to generate longer reads and I am applying it to do low-error sequencing.
        The quality scores are definitely phred33 to begin with. After overlapping the reads the fastq files have quality scores that are mostly ]]]].
        I want to filter quality scores at various thresholds just to compare the method to other methods, to show that this method has more SNPs with very high quality.

        The problem I am having is that when I use the options above (-noncons-filter-qual 60 --noncons-default-qual 60) on normal (non-overlapped) data with normal quality scores, I still get quite a lot of SNPs called. This shouldn't be possible as the reads max out at Q40. Does it make sense why this would happen? Am I using the options incorrectly?

        Thanks so much for your help.

         
  • jessica preston

    jessica preston - 2014-04-01

    Oh sorry I just saw your second reply. Thanks for your help.
    I wasn't clear about the SNV quality vs base quality. I want to filter by base quality. Is SNV quality what the options refer to, and is that why I am getting such high-quality SNVs?
    Thanks again.

     
  • jessica preston

    jessica preston - 2014-04-01

    One more question, when you say that the option -ignore-bases-below-q 60 removes reference bases below Q60, do you mean only if the base is in the reference sequence? I want to remove all SNVs below Q60.
    Thanks!

     
    • Andreas Wilm

      Andreas Wilm - 2014-04-02

      Hi Jessica,

      SNV quality and base quality are two different things. Base quality
      tells you high likely it is, that a certain base is an error. SNV
      quality is a measure how likely it is that a certain SNV (made up by a
      bunch of variant bases with varying base qualities and in the context
      of reference bases in the same column) is actually not a SNV.

      If you want to remove any base (reference or variant) below Q60, then
      the following arguments to lofreq_snpcaller.py should work (see also
      my previous email):
      --noncons-filter-qual 60 --noncons-default-qual 60 --ignore-bases-below-q 60
      LoFreq will then ignore those bases during SNV prediction.

      If you however want to remove any predicted SNVs with a quality below
      60, then you can use the following as argument to lofreq_filter.py:
      --snp-phred 60

      Technically you don't have to remove any bases, because the defaults
      should work just fine and SNV quality filtering is done automatically
      within LoFreq.

      I hope this answers your questions better,
      Andreas

      On 2 April 2014 03:12, jessica preston jpreston555@users.sf.net wrote:

      One more question, when you say that the option -ignore-bases-below-q 60
      removes reference bases below Q60, do you mean only if the base is in the
      reference sequence? I want to remove all SNVs below Q60.
      Thanks!


      Quality filtering


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/lofreq/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --
      Andreas Wilm
      andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC

       

Log in to post a comment.