Menu

Bug in VarScan 2.3.6 somatic VCF output

Help
2013-08-23
2016-05-26
  • Jonathan Ellis

    Jonathan Ellis - 2013-08-23

    I want to report what I believe is a bug in the VCF output of VarScan's somatic caller. I am getting VCF files with REF alleles containing slashes. I know this bug has been reported previously, but I'm still seeing it in version 2.3.6. An example is

    chr22 50468907 . C T/G . PASS DP=109;SS=5;SSC=0;GPV=1E0;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:49:0:21:100%:0,0,19,9 0/1:.:60:0:33:100%:0,0,28,5

    The REF allele should be T,G. Also, I've looked at the alignments and at this position the normal sample has a read depth of 54 with 24 G and 30 T. The tumour sample has a read depth of 68 with 38 G and 30 T. The reference base is C, therefore, the genotypes of both samples should be 1/2 and not 0/1 as reported in the VCF file.

    This causes problems for downstream tools as many refuse to run on an invalid VCF file, but also when annotating variant taking the genotype of 0/1 will give incorrect changes if the genotype is actually 1/2.

     
  • Dan Koboldt

    Dan Koboldt - 2013-08-26

    Jonathan,

    Thanks for bringing this to my attention; I'll look into it ASAP. Would you mind sending me the pileup for that position on chr22?

     
  • Jonathan Ellis

    Jonathan Ellis - 2013-08-27

    Dan,

    The pileup for the normal sample is

    chr22 50468907 C 49 TTGGTtTGTTTGGttGTTGTGTGGTTGgGtGggTGtGTtgtTTTGGttT 5ABADAFADEA@BBAADDAC5:@CC@G9?HFC@C@DBHCCCC>=B<B

    and the tumour is

    chr22 50468907 C 60 TGGTTTTGGGGTTGtTTTGTGTGtTTTGTGGtgtGGGGGGgGgtgGGttGGGtGtGTGgG @BCGFFG;@BBF7@B0E<AE8EACDBC C="=DGD?===?=F?GCG">@EA>>=E?E@F>G<

    Thanks for looking into this.

     
  • Luca Beltrame

    Luca Beltrame - 2013-10-08

    There are other breakages in the VCF output of 2.3.6: one being genotypes like "CT/-T", where aside the "/" instead of a comma, it is not clear what the "-T" would represent.

     
  • Dan Koboldt

    Dan Koboldt - 2013-10-08

    Luca,

    Thank you for bringing this to my attention, it seems like an edge case that still is not being formatted correctly. If you can provide example mpileups for the CT/- situation, that would help immensely!

     
  • Luca Beltrame

    Luca Beltrame - 2013-10-09

    Hello Dan,

    I'll make sure to post the pileups once I review the files. I put this problem aside because I found a much more serious issue in somatic output: multiple REF records.

    Basically, I have a record in a VCF (produced by VarScan somatic) like

    chr17   7572154 .   GAA/A   G   .   PASS    DP=36;SS=5;SSC=0;GPV=1E0;SPV=1.6725E-1  GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:6:0:4:1.0:0,0,2,2 0/1:.:30:7:10:0.5882:4,3,7,3
    

    Notice the GAA/A in the REF field (which causes software like the GATK to complain very loudly).

    The mpileups for the tumor and normal are

    chr17   7572154 g   30  .+1A.-1A*....-1A.-1A.-1A.-1A.-2AA.-3AAA.-3AAA..-1A.-1A.-3AAA.-2AA.-4AAAA*,-4aaaa,-3aaa,-2aa,-1a,-1a,-1a,,,+1a,  88;8888888998899888</0/00011/1
    chr17   7572154 g   6   .-2AA.-3AAA.-2AA,-3aaa,-2aa,-2aa    887/:.
    

    In both cases there is a deletion, but of different length. Why does VarScan report the REF base as GAA/A? I'm asking so I can fix it by "hand" while the problem is investigated.

    EDIT: Further investigation with VarScan's own output showed that in this case we have two events:

    • normal: deletion of two bases (GAA -> G)
    • tumor: deletion of one base (GA -> G)

    For reference, here's VarScan's output (called with --min-coverage 5 and --p-value 0.98):

    chrom   position    ref var normal_reads1   normal_reads2   normal_var_freq normal_gt   tumor_reads1    tumor_reads2    tumor_var_freq  tumor_gt    somatic_status  variant_p_value somatic_p_value tumor_reads1_plus   tumor_reads1_minus  tumor_reads2_plus   tumor_reads2_minus  normal_reads1_plus  normal_reads1_minus normal_reads2_plus  normal_reads2_minus
    chr17   7572154 G   -AA/-A  0   4   100%    */-AA   7   10  58,82%  */-A    Unknown 1.0 0.1672514619883056  4   3   7   3   0   0   2   2
    

    EDIT2: The correct representation, would be, as far as I understand the VCF spec

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
    chr17   7572154 .   GAAA    G,GA    .   PASS    DP=36;SS=5;SSC=0;GPV=1E0;SPV=1,6725E-1  GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:6:0:4:100%:0,0,2,2    0/1:.:30:7:10:58,82%:4,3,7,3
    

    Of course however the genotypes are wrong in this snippet: the first should be 0/1 and the second 0/2 (since they're both heterozygous for their own indel).

     

    Last edit: Luca Beltrame 2013-10-09
  • Luca Beltrame

    Luca Beltrame - 2013-10-09

    Following up from the older post: basically all "Unknown" (SS=5) results from somatic calling have some form of broken VCF output, mostly in the REF field, for example GA/+A.

     
  • CHinze

    CHinze - 2014-10-31

    hi luca,

    i know it has been a while since this last post. but, did you solve the problem?

    (apart from excluding all "Unknown"?)

     
  • Dan Koboldt

    Dan Koboldt - 2016-05-26

    This issue should be addressed in the latest VarScan release, v2.4.2.

     

Log in to post a comment.