I want to report what I believe is a bug in the VCF output of VarScan's somatic caller. I am getting VCF files with REF alleles containing slashes. I know this bug has been reported previously, but I'm still seeing it in version 2.3.6. An example is
The REF allele should be T,G. Also, I've looked at the alignments and at this position the normal sample has a read depth of 54 with 24 G and 30 T. The tumour sample has a read depth of 68 with 38 G and 30 T. The reference base is C, therefore, the genotypes of both samples should be 1/2 and not 0/1 as reported in the VCF file.
This causes problems for downstream tools as many refuse to run on an invalid VCF file, but also when annotating variant taking the genotype of 0/1 will give incorrect changes if the genotype is actually 1/2.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are other breakages in the VCF output of 2.3.6: one being genotypes like "CT/-T", where aside the "/" instead of a comma, it is not clear what the "-T" would represent.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for bringing this to my attention, it seems like an edge case that still is not being formatted correctly. If you can provide example mpileups for the CT/- situation, that would help immensely!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'll make sure to post the pileups once I review the files. I put this problem aside because I found a much more serious issue in somatic output: multiple REF records.
Basically, I have a record in a VCF (produced by VarScan somatic) like
Notice the GAA/A in the REF field (which causes software like the GATK to complain very loudly).
The mpileups for the tumor and normal are
chr17 7572154 g 30 .+1A.-1A*....-1A.-1A.-1A.-1A.-2AA.-3AAA.-3AAA..-1A.-1A.-3AAA.-2AA.-4AAAA*,-4aaaa,-3aaa,-2aa,-1a,-1a,-1a,,,+1a, 88;8888888998899888</0/00011/1
chr17 7572154 g 6 .-2AA.-3AAA.-2AA,-3aaa,-2aa,-2aa 887/:.
In both cases there is a deletion, but of different length. Why does VarScan report the REF base as GAA/A? I'm asking so I can fix it by "hand" while the problem is investigated.
EDIT: Further investigation with VarScan's own output showed that in this case we have two events:
normal: deletion of two bases (GAA -> G)
tumor: deletion of one base (GA -> G)
For reference, here's VarScan's output (called with --min-coverage 5 and --p-value 0.98):
EDIT2: The correct representation, would be, as far as I understand the VCF spec
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr17 7572154 . GAAA G,GA . PASS DP=36;SS=5;SSC=0;GPV=1E0;SPV=1,6725E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:6:0:4:100%:0,0,2,2 0/1:.:30:7:10:58,82%:4,3,7,3
Of course however the genotypes are wrong in this snippet: the first should be 0/1 and the second 0/2 (since they're both heterozygous for their own indel).
Last edit: Luca Beltrame 2013-10-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Following up from the older post: basically all "Unknown" (SS=5) results from somatic calling have some form of broken VCF output, mostly in the REF field, for example GA/+A.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to report what I believe is a bug in the VCF output of VarScan's somatic caller. I am getting VCF files with REF alleles containing slashes. I know this bug has been reported previously, but I'm still seeing it in version 2.3.6. An example is
chr22 50468907 . C T/G . PASS DP=109;SS=5;SSC=0;GPV=1E0;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:49:0:21:100%:0,0,19,9 0/1:.:60:0:33:100%:0,0,28,5
The REF allele should be T,G. Also, I've looked at the alignments and at this position the normal sample has a read depth of 54 with 24 G and 30 T. The tumour sample has a read depth of 68 with 38 G and 30 T. The reference base is C, therefore, the genotypes of both samples should be 1/2 and not 0/1 as reported in the VCF file.
This causes problems for downstream tools as many refuse to run on an invalid VCF file, but also when annotating variant taking the genotype of 0/1 will give incorrect changes if the genotype is actually 1/2.
Jonathan,
Thanks for bringing this to my attention; I'll look into it ASAP. Would you mind sending me the pileup for that position on chr22?
Dan,
The pileup for the normal sample is
chr22 50468907 C 49 TTGGTtTGTTTGGttGTTGTGTGGTTGgGtGggTGtGTtgtTTTGGttT 5ABADAFADEA@BBAADDAC5:@CC@G9?HFC@C@DBHCCCC>=B<B
and the tumour is
chr22 50468907 C 60 TGGTTTTGGGGTTGtTTTGTGTGtTTTGTGGtgtGGGGGGgGgtgGGttGGGtGtGTGgG @BCGFFG;@BBF7@B0E<AE8EACDBC C="=DGD?===?=F?GCG">@EA>>=E?E@F>G<
Thanks for looking into this.
There are other breakages in the VCF output of 2.3.6: one being genotypes like "CT/-T", where aside the "/" instead of a comma, it is not clear what the "-T" would represent.
Luca,
Thank you for bringing this to my attention, it seems like an edge case that still is not being formatted correctly. If you can provide example mpileups for the CT/- situation, that would help immensely!
Hello Dan,
I'll make sure to post the pileups once I review the files. I put this problem aside because I found a much more serious issue in somatic output: multiple REF records.
Basically, I have a record in a VCF (produced by VarScan somatic) like
Notice the GAA/A in the REF field (which causes software like the GATK to complain very loudly).
The mpileups for the tumor and normal are
In both cases there is a deletion, but of different length. Why does VarScan report the REF base as GAA/A? I'm asking so I can fix it by "hand" while the problem is investigated.
EDIT: Further investigation with VarScan's own output showed that in this case we have two events:
For reference, here's VarScan's output (called with --min-coverage 5 and --p-value 0.98):
EDIT2: The correct representation, would be, as far as I understand the VCF spec
Of course however the genotypes are wrong in this snippet: the first should be 0/1 and the second 0/2 (since they're both heterozygous for their own indel).
Last edit: Luca Beltrame 2013-10-09
Following up from the older post: basically all "Unknown" (SS=5) results from somatic calling have some form of broken VCF output, mostly in the REF field, for example GA/+A.
hi luca,
i know it has been a while since this last post. but, did you solve the problem?
(apart from excluding all "Unknown"?)
This issue should be addressed in the latest VarScan release, v2.4.2.