Menu

VarScan.v2.3.7 mpileup2snp

Help
Ania
2015-01-06
2015-01-06
  • Ania

    Ania - 2015-01-06

    I'm trying to find variants in a bam file from paired-end HiSeq illumina reads.
    I'm working with a hexaploid organism.
    I then want to visualise the variants in IGV.

    I'm using samtools to create the mpileup file, and then piping into VarScan, with the output modification to get a vcf format, which I index and load in IGV.

    Here is my code:

    ~~~~~~~
    samtools mpileup -f $REF/interval.fasta bowtie2_B135_short_mrna_gsort.bam | /software/jre-6.0.25/x86_64/bin/java -jar $HOME/VarScan.v2.3.7.jar mpileup2snp --output-vcf 1 > short_B135.vcf
    ~~~~~~

    I wanted to rank my SNPs by the quality score. For some reason, this comes out as '.', but in IGV viewer the quality score is taken as the second entry in the 'Sample1' column e.g. 255 for the first entry shown below. What is happening with this? Why don't I get a QUAL output in the 'QUAL' column?

    My other question is regarding the call of SampleHet/SampleHom. How is this being called? Are the more robust SNPs 'hom'?

    A subset of my output, without the top '##' rows is as follows:
    I've also attached the full .vcf file.

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1

    mrna126380 84 . A T . PASS ADP=391;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:398:391:204:187:47.83%:6.4318E-70:36:34:201:3:179:8
    mrna126380 90 . G A . PASS ADP=447;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:448:447:234:213:47.65%:1.7774E-79:35:35:221:13:199:14
    mrna126380 101 . C T . PASS ADP=468;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:469:468:249:219:46.79%:2.4749E-81:36:34:227:22:205:14
    mrna126380 117 . A C . PASS ADP=513;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:515:513:262:250:48.73%:9.4249E-94:30:32:213:49:217:33
    mrna126380 118 . C A . PASS ADP=526;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:530:526:283:243:46.2%:6.7648E-90:31:32:232:51:211:32
    mrna126380 120 . A C . PASS ADP=504;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:518:504:250:254:50.4%:4.7743E-96:30:32:202:48:218:36
    mrna126380 141 . T C . PASS ADP=590;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:593:590:286:304:51.53%:1.5393E-115:28:32:207:79:226:78
    mrna126380 147 . T C . PASS ADP=610;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:615:610:294:314:51.48%:2.134E-119:28:32:205:89:221:93
    mrna126380 193 . C T . PASS ADP=520;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:521:520:242:274:52.69%:4.1028E-105:33:32:88:154:155:119

     

    Last edit: Ania 2015-01-06
  • Dan Koboldt

    Dan Koboldt - 2015-01-06

    Ania,

    VarScan does not populate the QUAL column of the VCF file; instead, it provides the qualities on a per-sample basis (i.e. the GQ value). The reason the GQ value is not in the QUAL column is that mpileup (the SAMtools command) and mpileup2snp (the VarScan command) accommodate multi-sample calling, and it would not be obvious how to represent quality scores across multiple samples in a single value.

    SamplesHet and SamplesHom are very simple values telling you the number of samples called heterozygous (0/1) or homozygous-variant (1/1/), respectively. If you have one sample in your input/output, it's not terribly useful to you. When a variant is called, VarScan determines the zygosity based on the observed variant allele frequency. If it's above a user-configurable value (--min-freq-for-hom), the variant will be called homozygous. Otherwise, it will be called heterozygous.

    In terms of confidence, homozygous variants have more support (specifically higher VAFs) than heterozygous variants, so they're less likely to be false positives. However, the GQ and/or Fisher's Exact Test p-value (from which GQ is derived) are probably better indicators. In your example, however, I should point out that you have extremely low p-values due to the high depth of coverage (400-500x) so the scores all look good. If I were you, I might additionally sort by variant allele frequency to get high-confidence calls.

     
  • Ania

    Ania - 2015-01-06

    Thank you so much for this detailed response.

     

Log in to post a comment.