sorry for the late reply. I was on leave over Easter without internet access.
LoFreq's SNV qualities are Phred-scaled p-values, which describe how
likely a reported SNV is a false positive, i.e. not actually a SNV.
Basically LoFreq models SNVs as coin-tossing experiment, where the
error probability changes at each coin toss (i.e. bases in a pileup
column). As sources of errors, it takes base-qualities, mapping
qualities etc into account. LoFreq will only report SNVs with a
p-value smaller than 5% (i.e. a quality of 20) after multiple testing
correction. Please also refer to the LoFreq manuscript
(http://www.ncbi.nlm.nih.gov/pubmed/23066108) for more details.
Two 'unusual' values are possible:
Dot: LoFreq has the notion of consensus variants, which are
positions where more than 50% of bases differ from the reference. In
such cases LoFreq cannot calculate a probability using its model,
which is why the corresponding quality is set to 'not available' and
that corresponds to the dot character in vcf format.
2147483647: This corresponds to a p-value close to zero, i.e. a
highly significant SNV. The reason is this: to prevent taking the log
of zero, older versions of LoFreq (<version 2.0 RC1) set the Phred
score to the maximum integer (2147483647) if the corresponding p-value
was almost zero (<DBL_EPSILON).
Regarding your filtering question: yes, LoFreq version >2 can also
ignore bases below a certain base quality threshold. Have a look at
the 'Base-call quality' section in the usage generated by simply
calling 'lofreq call'. The easist would be to use:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
Small correction: I incorrectly said "LoFreq will only report SNVs
with a p-value smaller than 5% (i.e. a quality of 20)". However, the
Phred-value corresponding to 5% is 13, not 20.
sorry for the late reply. I was on leave over Easter without internet
access.
LoFreq's SNV qualities are Phred-scaled p-values, which describe how
likely a reported SNV is a false positive, i.e. not actually a SNV.
Basically LoFreq models SNVs as coin-tossing experiment, where the
error probability changes at each coin toss (i.e. bases in a pileup
column). As sources of errors, it takes base-qualities, mapping
qualities etc into account. LoFreq will only report SNVs with a
p-value smaller than 5% (i.e. a quality of 20) after multiple testing
correction. Please also refer to the LoFreq manuscript
(http://www.ncbi.nlm.nih.gov/pubmed/23066108) for more details.
Two 'unusual' values are possible:
Dot: LoFreq has the notion of consensus variants, which are
positions where more than 50% of bases differ from the reference. In
such cases LoFreq cannot calculate a probability using its model,
which is why the corresponding quality is set to 'not available' and
that corresponds to the dot character in vcf format.
2147483647: This corresponds to a p-value close to zero, i.e. a
highly significant SNV. The reason is this: to prevent taking the log
of zero, older versions of LoFreq (<version 2.0 RC1) set the Phred
score to the maximum integer (2147483647) if the corresponding p-value
was almost zero (<DBL_EPSILON).
Regarding your filtering question: yes, LoFreq version >2 can also
ignore bases below a certain base quality threshold. Have a look at
the 'Base-call quality' section in the usage generated by simply
calling 'lofreq call'. The easist would be to use:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
Andreas
On 18 April 2014 04:33, jessica preston jpreston555@users.sf.net wrote:
Hello,
Do you mind explaining how QUAL is calculated by LoFreq? I am using the new
version LoFreq-star.
Another question, does LoFreq-star have an option to filter snv's by base
quality. ie. like --ignore-bases in the previous version?
As sources of errors, it takes base-qualities, mapping
qualities etc into account.
Thanks for this. However I was wondering if there was a more thorough explanation of each of the values that are used in calculation of the 'QUAL' score values that are output in the VCF? I did not see it covered in the publication (maybe I missed it?) and wasn't able to figure out what was going on in the source code.
Last edit: Steve 2018-05-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
sure. The basics are explained in the NAR paper (Wilm, 2012): We compute a
poisson-binomial distribution taking error probabilities at each pileup
site into consideration and derive a p-value from that. Error probabilities
were originally just converted base qualities (because that's what they
are). In later LoFreq versions we merged base alignment, mapping and base
quality into one error probability per base. The logic goes like this:
either the read is misaligned (mapping quality) or if not, the base might
be misaligned, or if neither of that is true then the base itself might be
wrong, i.e.
P_m + (1-P_m)P_a + (1-P_m)(1-P_a)*P_b,
where P_m is the mapping error probability
P_a is the base alignment error probability (BAQ) and
P_b is the base error probability
As sources of errors, it takes base-qualities, mapping
qualities etc into account.
Thanks for this. However I was wondering if there was a more thorough
explanation of each of the values that are used in calculation of the
'QUAL' score values that are output in the VCF?
Hello,
Do you mind explaining how QUAL is calculated by LoFreq? I am using the new version LoFreq-star.
Another question, does LoFreq-star have an option to filter snv's by base quality. ie. like --ignore-bases in the previous version?
Thanks much!!
Hi Jessica,
sorry for the late reply. I was on leave over Easter without internet access.
LoFreq's SNV qualities are Phred-scaled p-values, which describe how
likely a reported SNV is a false positive, i.e. not actually a SNV.
Basically LoFreq models SNVs as coin-tossing experiment, where the
error probability changes at each coin toss (i.e. bases in a pileup
column). As sources of errors, it takes base-qualities, mapping
qualities etc into account. LoFreq will only report SNVs with a
p-value smaller than 5% (i.e. a quality of 20) after multiple testing
correction. Please also refer to the LoFreq manuscript
(http://www.ncbi.nlm.nih.gov/pubmed/23066108) for more details.
Two 'unusual' values are possible:
Dot: LoFreq has the notion of consensus variants, which are
positions where more than 50% of bases differ from the reference. In
such cases LoFreq cannot calculate a probability using its model,
which is why the corresponding quality is set to 'not available' and
that corresponds to the dot character in vcf format.
2147483647: This corresponds to a p-value close to zero, i.e. a
highly significant SNV. The reason is this: to prevent taking the log
of zero, older versions of LoFreq (<version 2.0 RC1) set the Phred
score to the maximum integer (2147483647) if the corresponding p-value
was almost zero (<DBL_EPSILON).
Regarding your filtering question: yes, LoFreq version >2 can also
ignore bases below a certain base quality threshold. Have a look at
the 'Base-call quality' section in the usage generated by simply
calling 'lofreq call'. The easist would be to use:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
Andreas
On 18 April 2014 04:33, jessica preston jpreston555@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Small correction: I incorrectly said "LoFreq will only report SNVs
with a p-value smaller than 5% (i.e. a quality of 20)". However, the
Phred-value corresponding to 5% is 13, not 20.
Andreas
On 21 April 2014 11:13, Andreas Wilm onde@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Thanks for this. However I was wondering if there was a more thorough explanation of each of the values that are used in calculation of the 'QUAL' score values that are output in the VCF? I did not see it covered in the publication (maybe I missed it?) and wasn't able to figure out what was going on in the source code.
Last edit: Steve 2018-05-03
Hi Steve,
sure. The basics are explained in the NAR paper (Wilm, 2012): We compute a
poisson-binomial distribution taking error probabilities at each pileup
site into consideration and derive a p-value from that. Error probabilities
were originally just converted base qualities (because that's what they
are). In later LoFreq versions we merged base alignment, mapping and base
quality into one error probability per base. The logic goes like this:
either the read is misaligned (mapping quality) or if not, the base might
be misaligned, or if neither of that is true then the base itself might be
wrong, i.e.
P_m + (1-P_m)P_a + (1-P_m)(1-P_a)*P_b,
where P_m is the mapping error probability
P_a is the base alignment error probability (BAQ) and
P_b is the base error probability
Hope this makes sense,
Andreas
On 4 May 2018 at 03:32, Steve stevekm@users.sourceforge.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC