From: jbr950 <jb...@gm...> - 2014-10-01 16:50:13
|
Hi Petr, Thanks for your time thus far. I'm trying to query some bam files at specified locations to have a look at the allele frequencies. The motivation is to 'validate' variants called in one dataset, in a novel dataset using a more simple approach. Ideally, I'd get the raw counts of the number of reads mapping to A,T,C,G, and/or indel (if detected). I tried mpileup two ways (with -cg in bcftools view to call variants, and without), and neither seems to provide me the data I would like in the corresponding output: samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile | $bcftools view - > no_cg.vcf samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile | $bcftools view -cg - > cg.vcf Then I check the context around 2 known variants: Variant 1 grep -C 1 105213319 cg.vcf 14 105213318 . A . 117 . DP=29;AF1=0;AC1=0;DP4=14,15,0,0;MQ=20;FQ=-114 PL 0 14 105213319 . G C 25 . DP=28;VDB=0.0398;AF1=0.5;AC1=1;DP4=10,9,4,5;MQ=20;FQ=28;PV4=1,1,1,1 GT:PL:GQ 0/1:55,0,128:58 14 105213320 . T . 114 . DP=28;AF1=0;AC1=0;DP4=14,14,0,0;MQ=20;FQ=-111 PL 0 Here, the variant was called and the DP4 field tells me the number of reference and non-reference (presumably "C") alleles. I am assuming that there are 0 T's and A's. grep -C 1 105213319 no_cg.vcf 14 105213318 . A X 0 . DP=29;I16=14,15,0,0,979,34421,0,0,580,11600,0,0,511,11223,0,0 PL 0,87,198 14 105213319 . G C,T,X 0 . DP=28;I16=10,9,4,5,645,23047,338,12742,380,7600,180,3600,339,7223,177,4179;VDB=0.0398 PL 55,0,128,96,140,195,112,152,198,201 14 105213320 . T X 0 . DP=28;I16=14,14,0,0,924,32326,0,0,560,11200,0,0,520,11560,0,0 PL 0,84,195 Here, I see that that the non-ref alleles consist of C's ant T's, although I don't see any way to determine the relative proportion or absolute number of each. Trying Variant 2: grep -C 1 154574018 cg.vcf 1 154574017 . T . 283 . DP=274;VDB=0.0365;AF1=0;AC1=0;DP4=146,128,0,0;MQ=20;FQ=-282 PL 0 1 154574018 . C . 113 . DP=273;VDB=0.0365;AF1=0;AC1=0;DP4=110,100,31,32;MQ=20;FQ=-110;PV4=0.67,1,1,1 PL 0 1 154574019 . T . 283 . DP=278;VDB=0.0365;AF1=0;AC1=0;DP4=142,136,0,0;MQ=20;FQ=-282 PL 0 This variant was not called at all. I may be willing to "assume" that the 63 'non-ref' reads reported in the DP4 field correspond to my variant, but it would be nice to be sure. grep -C 1 154574018 no_cg.vcf 1 154574017 . T X 0 . DP=274;I16=146,128,0,0,10042,376816,0,0,5463,109209,0,0,5159,117099,0,0;VDB=0.0365 PL 0,255,255 1 154574018 . C G,X 0 . DP=273;I16=110,100,31,32,7610,283160,2311,86963,4183,83609,1260,25200,3918,88780,1242,28290;VDB=0.0365 PL 0,83,117,255,255,220 1 154574019 . T X 0 . DP=278;I16=142,136,0,0,10104,376490,0,0,5543,110809,0,0,5167,116999,0,0;VDB=0.0365 PL 0,255,255 Here, the C to G is reported, but I again I only have total read depth via the DP field, no breakdown by nucleotide (or am I missing it?). It's true I'm using Samtools 0.1.18. If upgrading to the latest would give me a better option, I'm happy to do that, but after scouring the available documentation, I don't see a way to do this. In other words, I'd like to inquire at many positions the proportion of A's, T's, C's, G's and/or detected indels. Mpileup (at least how I'm currently using it) gets me close but not all the way there. Did it make sense? I'd appreciate any suggestions. Best regards, Jonathan On Tue, Sep 23, 2014 at 11:47 AM, Petr Danecek <pd...@sa...> wrote: > Hi Jonathan, > > On Tue, 2014-09-23 at 11:05 -0400, jbr950 wrote: > > Hi Petr, > > The workflow documentation looks great - this effort is undoubtedly > > going to help a lot of researchers. Thanks for doing it. > > the credits for the workflow should not go to me, the htslib.org page > has been put together by Martin Pollard, Thomas Keane and others. > :-) > > > In this instance, I actually don't need statistics and the raw pileups > > are fine. Which brings me back to the original issue. With the new > > version, the vcf outputs from mpileup, using the same .bed as input, > > are still of varying lengths. Do you know of a way to force mpileup > > to report data at every position in the .bed file for every run? Or > > alternatively, do you know why some positions report a DP=0 and other > > positions are omitted altogether? > > These DP=0 sites are emitted at places where there is an aligned read > with an indel. The indel may not appear in the output when the mpileup > machinery decides it is not significant enough. > > I hope this helps, > Petr > > > > Thanks again, > > Jonathan > > > > > > > > On Tue, Sep 23, 2014 at 9:26 AM, Petr Danecek <pd...@sa...> > > wrote: > > There is a slowly growing documentation on the new website, > > please check > > the variant calling workflow > > http://www.htslib.org/workflow/ > > > > You need to use `bcftools call`, not `bcftools view`. > > > > petr > > > > > > On Tue, 2014-09-23 at 09:25 -0400, jbr950 wrote: > > > Good catch - I was trying to use an old version of bcftools > > with the > > > new version of samtools.. > > > > > > > > > So I changed that to point to the new version of bcftools, > > and now: > > > > > > > > > samtools-bcftools-htslib-1.0_x64-linux/bin/samtools mpileup > > -BQ0 > > > -d10000000 -l $variantBed -uf ${ref} $bamFile > test.bcf > > > > > > > > > > > > So far so good.. > > > > > > > > > And then: > > > cat test.bcf | > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools > > > view -cg - > > > Error: Could not parse --min-ac g > > > > > > > > > > > > On the other hand, calling: . > > > cat test.bcf | > > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools > > > view- > > > > > > > > > > > > does create a nice output, so I'm inclined to let it go with > > the > > > defaults. I don't see equivalent arguments for the -c and -g > > flags and > > > I'm not particularly attached to the use of Bayesian > > inference, > > > although it would be good to understand what differences I > > can expect > > > between the old method (using -c) and the new (sans -c)? > > > > > > > > > Thanks for your time and assistance! > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > On Tue, Sep 23, 2014 at 2:40 AM, Petr Danecek > > <pd...@sa...> > > > wrote: > > > This does not look like output from samtools 1.0 - > > it outputs > > > VCFv4.2 > > > > > > petr > > > > > > On Mon, 2014-09-22 at 14:53 -0400, jbr950 wrote: > > > > Hi Petr, > > > > Thanks for your reply. I grabbed the > > > > samtools-bcftools-htslib-1.0_x64-linux binary and > > tried > > > again. > > > > > > > > > > > > When I'd used Samtools 0.1.18, the issue I emailed > > about was > > > that the > > > > number of lines in output varied by .bam file, and > > I didn't > > > understand > > > > why lines were being ommitted, and why not in a > > common > > > manner. > > > > > > > > > > > > Using version 1.0 and the same command (I checked > > on the > > > older > > > > samtools and it is producing output), I get no > > output at > > > all. Mpileup > > > > writes a header and then stops. I have copied and > > pasted > > > the output. > > > > > > > > > > > > > > > > My .bed is 3 columns: chr start stop > > > > with no header. > > > > > > > > > > > > Any advice appreciated, thanks! > > > > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > > > [mpileup] 1 samples in 1 input files > > > > (mpileup) Max depth is above 1M. Potential memory > > hog! > > > > [bcf_sync] incorrect number of fields (0 != 5) at > > 0:0 > > > > [afs] 0:0.000 > > > > ##fileformat=VCFv4.1 > > > > > > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read > > > depth"> > > > > > > ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# > > > high-quality > > > > ref-forward bases, ref-reverse, alt-forward and > > alt-reverse > > > bases"> > > > > > > > > > > ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square > > > > mapping quality of covering reads"> > > > > > > ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred > > > probability of > > > > all samples being the same"> > > > > > > > > > ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood > > > > estimate of the first ALT allele frequency > > (assuming HWE)"> > > > > > > > > > ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood > > > > estimate of the first ALT allele count (no HWE > > assumption)"> > > > > ##INFO=<ID=G3,Number=3,Type=Float,Description="ML > > estimate > > > of genotype > > > > frequencies"> > > > > > > ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based > > > HWE test > > > > P-value based on G3"> > > > > > > ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio > > > of > > > > genotype likelihoods with and without the > > constraint"> > > > > > > ##INFO=<ID=UGT,Number=1,Type=String,Description="The most > > > probable > > > > unconstrained genotype configuration in the trio"> > > > > > > ##INFO=<ID=CGT,Number=1,Type=String,Description="The most > > > probable > > > > constrained genotype configuration in the trio"> > > > > > > ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for > > > strand > > > > bias, baseQ bias, mapQ bias and tail distance > > bias"> > > > > > > ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates > > > that the > > > > variant is an INDEL."> > > > > > > ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred > > > probability of > > > > the nonRef allele frequency in group1 samples > > being larger > > > (,smaller) > > > > than in group2."> > > > > > > ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior > > > weighted > > > > chi^2 P-value for testing the association between > > group1 and > > > group2 > > > > samples."> > > > > > > ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred > > > scaled > > > > PCHI2."> > > > > ##INFO=<ID=PR,Number=1,Type=Integer,Description="# > > > permutations > > > > yielding a smaller PCHI2."> > > > > > > ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant > > > Distance > > > > Bias"> > > > > > > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> > > > > > > ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype > > > Quality"> > > > > > > ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods > > > for > > > > RR,RA,AA genotypes (R=ref,A=alt)"> > > > > > > ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# > > > high-quality > > > > bases"> > > > > > > > > > ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled > strand > > > > bias P-value"> > > > > > > ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of > > > > Phred-scaled genotype likelihoods"> > > > > #CHROM POS ID REF ALT QUAL > > FILTER INFO > > > FORMAT > > > > > > > > > > > > > > > > On Thu, Sep 4, 2014 at 5:20 AM, Petr Danecek > > > <pd...@sa...> wrote: > > > > Hi Jonathan, > > > > > > > > these are good questions. Could you please > > try with > > > the latest > > > > release, > > > > I am happy to answer any remaining issues > > not solved > > > by the > > > > upgrade. > > > > > > > > Cheers, > > > > Petr > > > > > > > > On Sun, 2014-08-17 at 01:34 -0400, Jo R > > wrote: > > > > > Hello, > > > > > I'm attemping to get nucleotide > > frequency > > > information at > > > > a number > > > > > of positions across a number of samples, > > and am > > > having > > > > difficulty > > > > > interpreting some output. Any insights > > would be > > > > appreciated. > > > > > > > > > > > > > > > I'm running the following command: > > > > > > > > > > > > > > > samtools mpileup -BQ0 -d10000000 -l > > > VariantBed.bed -uf > > > > $refFile $bam > > > > > | bcftools view -bcg - | bcftools view - > > > > > > > > ${sampleName}_validation.vcf > > > > > > > > > > > > > > > > > > > > I notice that this command creates an > > output file > > > with > > > > > an unpredictable number of rows. > > Running the > > > command using > > > > the same > > > > > bed file on a set of different .bam > > files creates > > > a set of > > > > output vcf > > > > > files with a wide distribution in > > numbers of rows. > > > > > > > > > > > > > > > I presumed that the difference in row > > numbers > > > means that > > > > some > > > > > positions drop out on some .bam files > > because > > > those samples > > > > lacked > > > > > coverage where other samples had > > coverage. > > > > > > > > > > > > > > > If that's the case, though, I don't know > > what to > > > make of > > > > lines like > > > > > the following one: > > > > > > > > > > > > > > > 1 2160881 . G . > > 28.2 . > > > > > DP=0;VDB=0.0003;;AC1=2;FQ=-30 PL > > 0 > > > > > > > > > > > > > > > > > > > > here, it looks like DP=0, but this > > position still > > > got > > > > reported in the > > > > > vcf output. I also don't see AC1 in the > > legend > > > for INFO > > > > tags in the > > > > > samtools specification page, so I don't > > know what > > > to make of > > > > a value > > > > > of 2. > > > > > > > > > > > > > > > So, I am confused. Positions with a > > positive > > > value of DP > > > > and DP4 make > > > > > sense to me. But why are some positions > > > completely ommitted > > > > from the > > > > > vcf output, and other positions > > reporting a DP=0? > > > > > > > > > > > > > > > Thanks for any advice. > > > > > > > > > > > > > > > Best regards, > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > _______________________________________________ > > > > > Samtools-help mailing list > > > > > Sam...@li... > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/samtools-help > > > > > > > > > > > > > > > > > > > > -- > > > > The Wellcome Trust Sanger Institute is > > operated by > > > Genome > > > > Research > > > > Limited, a charity registered in England > > with > > > number 1021457 > > > > and a > > > > company registered in England with number > > 2742969, > > > whose > > > > registered > > > > office is 215 Euston Road, London, NW1 > > 2BE. > > > > > > > > > > > > > > > > > > > > > > > -- > > > The Wellcome Trust Sanger Institute is operated by > > Genome > > > Research > > > Limited, a charity registered in England with > > number 1021457 > > > and a > > > company registered in England with number 2742969, > > whose > > > registered > > > office is 215 Euston Road, London, NW1 2BE. > > > > > > > > > > > > > > > > > > > -- > > The Wellcome Trust Sanger Institute is operated by Genome > > Research > > Limited, a charity registered in England with number 1021457 > > and a > > company registered in England with number 2742969, whose > > registered > > office is 215 Euston Road, London, NW1 2BE. > > > > > > > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > |