LoFreq / Discussion / General Discussion: Quality filtering

jessica preston - 2014-03-31

Hi,

I have been calling rare variants with lofreq and it seems to be working great. However, I am having trouble figuring out how to set a quality threshold for calling the variants. The data I have is very high quality, with average base quality scores above Q60. (Data is Illumina 1.8+) This is a higher quality than the raw Illumina output; the high quality scores are generated by the program SeqPrep.

I want to use lofreq to call high quality, rare SNPs. I've tried using these options with the lofreq_snpcaller.py:

-noncons-filter-qual 60 --noncons-default-qual 60

However I am still getting lots of variants called with normal quality data (Q30). Is this something that can be done with Lofreq or should I quality filter before running Lofreq? Am I using the right encoding for the quality scores?

Any help is greatly appreciated!!

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andreas Wilm - 2014-04-01
  
  Hi Jessica,
  
  getting average base quality scores above Q60 in FastQs file doesn't
  sound right. The base qualities were probably Illumina 1.3-1.7 encoded
  (ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
  Sanger encoded; ASCII+33). In other words the values are 31 too high.
  My guess is that would have to convince SeqPrep that your files had a
  different encoding (option -6 I think but refer to SeqPrep's manual).
  You can confirm the encoding by running FastQC on your original FastQ
  files.
  
  As a direct consequence LoFreq will probably overpredict SNVs. Once
  the base qualities are corrected, there should be no need to change
  any base filtering settings: they work just fine for Illumina and
  there's also no need to filter SNVs later-on since they are already
  filtered based on multiple testing corrected p-values.
  
  Hope that helps,
  Andreas
  
  On 1 April 2014 07:50, jessica preston jpreston555@users.sf.net wrote:
  
  Hi,
  
  I have been calling rare variants with lofreq and it seems to be working
  great. However, I am having trouble figuring out how to set a quality
  threshold for calling the variants. The data I have is very high quality,
  with average base quality scores above Q60. (Data is Illumina 1.8+) This is
  a higher quality than the raw Illumina output; the high quality scores are
  generated by the program SeqPrep.
  
  I want to use lofreq to call high quality, rare SNPs. I've tried using these
  options with the lofreq_snpcaller.py:
  
  -noncons-filter-qual 60 --noncons-default-qual 60
  
  However I am still getting lots of variants called with normal quality data
  (Q30). Is this something that can be done with Lofreq or should I quality
  filter before running Lofreq? Am I using the right encoding for the quality
  scores?
  
  Any help is greatly appreciated!!
  
  Thanks.
  
  Quality filtering
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/lofreq/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  Andreas Wilm
  andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Andreas Wilm - 2014-04-01
    
    Actually, let me take back what I just wrote. The quality encoding is
    probably correct and SeqPrep merges base-qualities in overlapping
    reads, thus producing high qualities. You kind of said that already,
    but I didn't read your email thoroughly enough. Sorry.
    
    If you really want to remove non-reference bases below those high
    averages values then you can use the special settings you mention. If
    you in addition also want to ignore reference bases below that value,
    then you would also have to use --ignore-bases-below-q 60.
    
    Technically, neither of these settings will be necessary though,
    because LoFreq interprets qualities as what they are, i.e. error
    probabilities and those are build into its prediction model. LoFreq
    should work fine without these special settings.
    
    When you say you get "lots of variants called with normal quality data
    (Q30)" you mean the SNV quality or base quality Q30? Getting a SNVs
    with quality 30 is not necessarily unusual even with high quality
    bases.
    
    Andreas
    
    On 1 April 2014 09:43, Andreas Wilm andreas.wilm@gmail.com wrote:
    
    Hi Jessica,
    
    getting average base quality scores above Q60 in FastQs file doesn't
    sound right. The base qualities were probably Illumina 1.3-1.7 encoded
    (ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
    Sanger encoded; ASCII+33). In other words the values are 31 too high.
    My guess is that would have to convince SeqPrep that your files had a
    different encoding (option -6 I think but refer to SeqPrep's manual).
    You can confirm the encoding by running FastQC on your original FastQ
    files.
    
    As a direct consequence LoFreq will probably overpredict SNVs. Once
    the base qualities are corrected, there should be no need to change
    any base filtering settings: they work just fine for Illumina and
    there's also no need to filter SNVs later-on since they are already
    filtered based on multiple testing corrected p-values.
    
    Hope that helps,
    Andreas
    
    On 1 April 2014 07:50, jessica preston jpreston555@users.sf.net wrote:
    
    Hi,
    
    I have been calling rare variants with lofreq and it seems to be working
    great. However, I am having trouble figuring out how to set a quality
    threshold for calling the variants. The data I have is very high quality,
    with average base quality scores above Q60. (Data is Illumina 1.8+) This is
    a higher quality than the raw Illumina output; the high quality scores are
    generated by the program SeqPrep.
    
    I want to use lofreq to call high quality, rare SNPs. I've tried using these
    options with the lofreq_snpcaller.py:
    
    -noncons-filter-qual 60 --noncons-default-qual 60
    
    However I am still getting lots of variants called with normal quality data
    (Q30). Is this something that can be done with Lofreq or should I quality
    filter before running Lofreq? Am I using the right encoding for the quality
    scores?
    
    Any help is greatly appreciated!!
    
    Thanks.
    
    Quality filtering
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/lofreq/discussion/general/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    --
    Andreas Wilm
    andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
    
    --
    Andreas Wilm
    andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - jessica preston - 2014-04-01
    
    Thanks for responding so quickly!
    I didn't explain well enough what I am doing. I am developing a low-error sequencing method where I use paired-end sequencing to generate completely overlapping short sequence reads in order to reduce error rates. This way a sequencing error is obvious because it appears in only one overlapping read, and therefore error rates are exponentially reduced. Seqprep is a program that uses the overlap information to determine a new quality score based on the overlap information. It was originally used to generate longer reads and I am applying it to do low-error sequencing.
    The quality scores are definitely phred33 to begin with. After overlapping the reads the fastq files have quality scores that are mostly ]]]].
    I want to filter quality scores at various thresholds just to compare the method to other methods, to show that this method has more SNPs with very high quality.
    
    The problem I am having is that when I use the options above (-noncons-filter-qual 60 --noncons-default-qual 60) on normal (non-overlapped) data with normal quality scores, I still get quite a lot of SNPs called. This shouldn't be possible as the reads max out at Q40. Does it make sense why this would happen? Am I using the options incorrectly?
    
    Thanks so much for your help.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jessica preston - 2014-04-01

Oh sorry I just saw your second reply. Thanks for your help.
I wasn't clear about the SNV quality vs base quality. I want to filter by base quality. Is SNV quality what the options refer to, and is that why I am getting such high-quality SNVs?
Thanks again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jessica preston - 2014-04-01

One more question, when you say that the option -ignore-bases-below-q 60 removes reference bases below Q60, do you mean only if the base is in the reference sequence? I want to remove all SNVs below Q60.
Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andreas Wilm - 2014-04-02
  
  Hi Jessica,
  
  SNV quality and base quality are two different things. Base quality
  tells you high likely it is, that a certain base is an error. SNV
  quality is a measure how likely it is that a certain SNV (made up by a
  bunch of variant bases with varying base qualities and in the context
  of reference bases in the same column) is actually not a SNV.
  
  If you want to remove any base (reference or variant) below Q60, then
  the following arguments to lofreq_snpcaller.py should work (see also
  my previous email):
  --noncons-filter-qual 60 --noncons-default-qual 60 --ignore-bases-below-q 60
  LoFreq will then ignore those bases during SNV prediction.
  
  If you however want to remove any predicted SNVs with a quality below
  60, then you can use the following as argument to lofreq_filter.py:
  --snp-phred 60
  
  Technically you don't have to remove any bases, because the defaults
  should work just fine and SNV quality filtering is done automatically
  within LoFreq.
  
  I hope this answers your questions better,
  Andreas
  
  On 2 April 2014 03:12, jessica preston jpreston555@users.sf.net wrote:
  
  One more question, when you say that the option -ignore-bases-below-q 60
  removes reference bases below Q60, do you mean only if the base is in the
  reference sequence? I want to remove all SNVs below Q60.
  Thanks!
  
  Quality filtering
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/lofreq/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  Andreas Wilm
  andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quality filtering

Fast and sensitive variant-calling from sequencing data

Forums

Help

Quality filtering

Quality filtering

Fast and sensitive variant-calling from sequencing data

Forums

Help

Quality filtering document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Quality filtering