|
From: John M. <jm...@sa...> - 2016-07-26 16:04:15
|
On 22 Jul 2016, at 21:49, Annie Cowell <ann...@gm...> wrote: > samtools mpileup -C50 -Bug -t AD -Q10 -f my.fasta sample.bam | bcftools call -mv -Ov > my.vcf > > However, I am getting unusual bases for a couple of indels in the reference field, such as W (as below) and R. > > AAKM01000065 8883 . GGW G 228 . INDEL;IDV=40;IMF=0.888889;DP=45;VDB=0.880501;SGB=-0.692976;MQSB=0.114162;MQ0F=0;AC=2;AN=2;DP4=0,0,8,18;MQ=49 GT:PL:AD 1/1:255,78,0:0,26 As Thomas guessed, these are indeed ambiguity codes coming from the reference genome. From http://www.ncbi.nlm.nih.gov/nuccore/AAKM01000065 we can see the GGW bases at location 8883 onwards: 8881 aaggwtataa agaacgcata tatgcctttt gtcccctgcc cgcgtctgga tactcgctac The issue of IUPAC ambiguity codes in VCF REF fields came up a while ago [1], and the consensus expressed in the VCF v4.3 spec is that they shouldn't appear -- instead being simplified to a non-ambiguous base, in this case GGA. So this is an mpileup and/or bcftools bug, and apparently it should be suppressing ambiguity codes by default. John [1] https://github.com/samtools/hts-specs/issues/54 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |