I have been calling rare variants with lofreq and it seems to be working great. However, I am having trouble figuring out how to set a quality threshold for calling the variants. The data I have is very high quality, with average base quality scores above Q60. (Data is Illumina 1.8+) This is a higher quality than the raw Illumina output; the high quality scores are generated by the program SeqPrep.
I want to use lofreq to call high quality, rare SNPs. I've tried using these options with the lofreq_snpcaller.py:
-noncons-filter-qual 60 --noncons-default-qual 60
However I am still getting lots of variants called with normal quality data (Q30). Is this something that can be done with Lofreq or should I quality filter before running Lofreq? Am I using the right encoding for the quality scores?
Any help is greatly appreciated!!
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
getting average base quality scores above Q60 in FastQs file doesn't
sound right. The base qualities were probably Illumina 1.3-1.7 encoded
(ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
Sanger encoded; ASCII+33). In other words the values are 31 too high.
My guess is that would have to convince SeqPrep that your files had a
different encoding (option -6 I think but refer to SeqPrep's manual).
You can confirm the encoding by running FastQC on your original FastQ
files.
As a direct consequence LoFreq will probably overpredict SNVs. Once
the base qualities are corrected, there should be no need to change
any base filtering settings: they work just fine for Illumina and
there's also no need to filter SNVs later-on since they are already
filtered based on multiple testing corrected p-values.
I have been calling rare variants with lofreq and it seems to be working
great. However, I am having trouble figuring out how to set a quality
threshold for calling the variants. The data I have is very high quality,
with average base quality scores above Q60. (Data is Illumina 1.8+) This is
a higher quality than the raw Illumina output; the high quality scores are
generated by the program SeqPrep.
I want to use lofreq to call high quality, rare SNPs. I've tried using these
options with the lofreq_snpcaller.py:
-noncons-filter-qual 60 --noncons-default-qual 60
However I am still getting lots of variants called with normal quality data
(Q30). Is this something that can be done with Lofreq or should I quality
filter before running Lofreq? Am I using the right encoding for the quality
scores?
Actually, let me take back what I just wrote. The quality encoding is
probably correct and SeqPrep merges base-qualities in overlapping
reads, thus producing high qualities. You kind of said that already,
but I didn't read your email thoroughly enough. Sorry.
If you really want to remove non-reference bases below those high
averages values then you can use the special settings you mention. If
you in addition also want to ignore reference bases below that value,
then you would also have to use --ignore-bases-below-q 60.
Technically, neither of these settings will be necessary though,
because LoFreq interprets qualities as what they are, i.e. error
probabilities and those are build into its prediction model. LoFreq
should work fine without these special settings.
When you say you get "lots of variants called with normal quality data
(Q30)" you mean the SNV quality or base quality Q30? Getting a SNVs
with quality 30 is not necessarily unusual even with high quality
bases.
getting average base quality scores above Q60 in FastQs file doesn't
sound right. The base qualities were probably Illumina 1.3-1.7 encoded
(ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
Sanger encoded; ASCII+33). In other words the values are 31 too high.
My guess is that would have to convince SeqPrep that your files had a
different encoding (option -6 I think but refer to SeqPrep's manual).
You can confirm the encoding by running FastQC on your original FastQ
files.
As a direct consequence LoFreq will probably overpredict SNVs. Once
the base qualities are corrected, there should be no need to change
any base filtering settings: they work just fine for Illumina and
there's also no need to filter SNVs later-on since they are already
filtered based on multiple testing corrected p-values.
I have been calling rare variants with lofreq and it seems to be working
great. However, I am having trouble figuring out how to set a quality
threshold for calling the variants. The data I have is very high quality,
with average base quality scores above Q60. (Data is Illumina 1.8+) This is
a higher quality than the raw Illumina output; the high quality scores are
generated by the program SeqPrep.
I want to use lofreq to call high quality, rare SNPs. I've tried using these
options with the lofreq_snpcaller.py:
-noncons-filter-qual 60 --noncons-default-qual 60
However I am still getting lots of variants called with normal quality data
(Q30). Is this something that can be done with Lofreq or should I quality
filter before running Lofreq? Am I using the right encoding for the quality
scores?
Thanks for responding so quickly!
I didn't explain well enough what I am doing. I am developing a low-error sequencing method where I use paired-end sequencing to generate completely overlapping short sequence reads in order to reduce error rates. This way a sequencing error is obvious because it appears in only one overlapping read, and therefore error rates are exponentially reduced. Seqprep is a program that uses the overlap information to determine a new quality score based on the overlap information. It was originally used to generate longer reads and I am applying it to do low-error sequencing.
The quality scores are definitely phred33 to begin with. After overlapping the reads the fastq files have quality scores that are mostly ]]]].
I want to filter quality scores at various thresholds just to compare the method to other methods, to show that this method has more SNPs with very high quality.
The problem I am having is that when I use the options above (-noncons-filter-qual 60 --noncons-default-qual 60) on normal (non-overlapped) data with normal quality scores, I still get quite a lot of SNPs called. This shouldn't be possible as the reads max out at Q40. Does it make sense why this would happen? Am I using the options incorrectly?
Thanks so much for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh sorry I just saw your second reply. Thanks for your help.
I wasn't clear about the SNV quality vs base quality. I want to filter by base quality. Is SNV quality what the options refer to, and is that why I am getting such high-quality SNVs?
Thanks again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One more question, when you say that the option -ignore-bases-below-q 60 removes reference bases below Q60, do you mean only if the base is in the reference sequence? I want to remove all SNVs below Q60.
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SNV quality and base quality are two different things. Base quality
tells you high likely it is, that a certain base is an error. SNV
quality is a measure how likely it is that a certain SNV (made up by a
bunch of variant bases with varying base qualities and in the context
of reference bases in the same column) is actually not a SNV.
If you want to remove any base (reference or variant) below Q60, then
the following arguments to lofreq_snpcaller.py should work (see also
my previous email):
--noncons-filter-qual 60 --noncons-default-qual 60 --ignore-bases-below-q 60
LoFreq will then ignore those bases during SNV prediction.
If you however want to remove any predicted SNVs with a quality below
60, then you can use the following as argument to lofreq_filter.py:
--snp-phred 60
Technically you don't have to remove any bases, because the defaults
should work just fine and SNV quality filtering is done automatically
within LoFreq.
I hope this answers your questions better,
Andreas
One more question, when you say that the option -ignore-bases-below-q 60
removes reference bases below Q60, do you mean only if the base is in the
reference sequence? I want to remove all SNVs below Q60.
Thanks!
Hi,
I have been calling rare variants with lofreq and it seems to be working great. However, I am having trouble figuring out how to set a quality threshold for calling the variants. The data I have is very high quality, with average base quality scores above Q60. (Data is Illumina 1.8+) This is a higher quality than the raw Illumina output; the high quality scores are generated by the program SeqPrep.
I want to use lofreq to call high quality, rare SNPs. I've tried using these options with the lofreq_snpcaller.py:
-noncons-filter-qual 60 --noncons-default-qual 60
However I am still getting lots of variants called with normal quality data (Q30). Is this something that can be done with Lofreq or should I quality filter before running Lofreq? Am I using the right encoding for the quality scores?
Any help is greatly appreciated!!
Thanks.
Hi Jessica,
getting average base quality scores above Q60 in FastQs file doesn't
sound right. The base qualities were probably Illumina 1.3-1.7 encoded
(ASCII+64) originally and mistakenly interpreted as Ilumina 1.8 (i.e.
Sanger encoded; ASCII+33). In other words the values are 31 too high.
My guess is that would have to convince SeqPrep that your files had a
different encoding (option -6 I think but refer to SeqPrep's manual).
You can confirm the encoding by running FastQC on your original FastQ
files.
As a direct consequence LoFreq will probably overpredict SNVs. Once
the base qualities are corrected, there should be no need to change
any base filtering settings: they work just fine for Illumina and
there's also no need to filter SNVs later-on since they are already
filtered based on multiple testing corrected p-values.
Hope that helps,
Andreas
On 1 April 2014 07:50, jessica preston jpreston555@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Actually, let me take back what I just wrote. The quality encoding is
probably correct and SeqPrep merges base-qualities in overlapping
reads, thus producing high qualities. You kind of said that already,
but I didn't read your email thoroughly enough. Sorry.
If you really want to remove non-reference bases below those high
averages values then you can use the special settings you mention. If
you in addition also want to ignore reference bases below that value,
then you would also have to use --ignore-bases-below-q 60.
Technically, neither of these settings will be necessary though,
because LoFreq interprets qualities as what they are, i.e. error
probabilities and those are build into its prediction model. LoFreq
should work fine without these special settings.
When you say you get "lots of variants called with normal quality data
(Q30)" you mean the SNV quality or base quality Q30? Getting a SNVs
with quality 30 is not necessarily unusual even with high quality
bases.
Andreas
On 1 April 2014 09:43, Andreas Wilm andreas.wilm@gmail.com wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Thanks for responding so quickly!
I didn't explain well enough what I am doing. I am developing a low-error sequencing method where I use paired-end sequencing to generate completely overlapping short sequence reads in order to reduce error rates. This way a sequencing error is obvious because it appears in only one overlapping read, and therefore error rates are exponentially reduced. Seqprep is a program that uses the overlap information to determine a new quality score based on the overlap information. It was originally used to generate longer reads and I am applying it to do low-error sequencing.
The quality scores are definitely phred33 to begin with. After overlapping the reads the fastq files have quality scores that are mostly ]]]].
I want to filter quality scores at various thresholds just to compare the method to other methods, to show that this method has more SNPs with very high quality.
The problem I am having is that when I use the options above (-noncons-filter-qual 60 --noncons-default-qual 60) on normal (non-overlapped) data with normal quality scores, I still get quite a lot of SNPs called. This shouldn't be possible as the reads max out at Q40. Does it make sense why this would happen? Am I using the options incorrectly?
Thanks so much for your help.
Oh sorry I just saw your second reply. Thanks for your help.
I wasn't clear about the SNV quality vs base quality. I want to filter by base quality. Is SNV quality what the options refer to, and is that why I am getting such high-quality SNVs?
Thanks again.
One more question, when you say that the option -ignore-bases-below-q 60 removes reference bases below Q60, do you mean only if the base is in the reference sequence? I want to remove all SNVs below Q60.
Thanks!
Hi Jessica,
SNV quality and base quality are two different things. Base quality
tells you high likely it is, that a certain base is an error. SNV
quality is a measure how likely it is that a certain SNV (made up by a
bunch of variant bases with varying base qualities and in the context
of reference bases in the same column) is actually not a SNV.
If you want to remove any base (reference or variant) below Q60, then
the following arguments to lofreq_snpcaller.py should work (see also
my previous email):
--noncons-filter-qual 60 --noncons-default-qual 60 --ignore-bases-below-q 60
LoFreq will then ignore those bases during SNV prediction.
If you however want to remove any predicted SNVs with a quality below
60, then you can use the following as argument to lofreq_filter.py:
--snp-phred 60
Technically you don't have to remove any bases, because the defaults
should work just fine and SNV quality filtering is done automatically
within LoFreq.
I hope this answers your questions better,
Andreas
On 2 April 2014 03:12, jessica preston jpreston555@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC