|
From: Adam A. <ada...@gm...> - 2012-09-24 13:58:07
|
Yes - you will need GT fields to calculate heterozygosity. Kind regards, Adam On 24 September 2012 04:37, Andre Moura <av...@gm...> wrote: > Hi Adam, > > Thanks a lot for your swift reply. Yes, your explanation helps an awful > lot, as it helps me understand the workings of vcftools as well. So, if I > understand correctly, vcftool will retrieve information from the FORMAT > field and not from the INFO field. I was assuming that it would be able to > read from the INFO field, hence my confusion. Given this, to calculate > Heterozigosity and nucelotide diversity, then I would need to have > information such as GT in the FORMAT field as well, is this correct? > Rebuilding the vcf file to include the necessary information is no problem, > so as long as I know which fields need to be there I can repeat the > analysis as appropriate. > > Best regards, > > Andre > > On 24/09/2012 00:00, Adam Auton wrote: > > Hello Andre, > > Thank you for the clear description of your problem. The issue is that your VCF file does not contain depth information for each sample (specifically, there is no DP entry in the FORMAT field). Vcftools calculates the depth using the per-sample data, and hence can't find the required information in your file. > > However, I do note that your VCF file contains depth information in the INFO field. Maybe this is what you need? If so, then you can use something like the following command to extract that field: > > vcftools --vcf in.vcf --get-INFO DP > > I hope that helps. > > Adam > > -- > Adam Auton > Assistant Professor, > Department of Genetics, > Albert Einstein College of Medicine, > 1301 Morris Park Avenue, > Price Center 320, > Bronx, New York 10461 > > Tel: +1 (718) 678 1038 > > On Sep 23, 2012, at 6:30 PM, Andre Moura <av...@gm...> <av...@gm...> wrote: > > > Hi everyone, > > I've just started trying to use vcftools on a genome I've been working, > but encounter a problem whose origin is not immediately clear to me. > I've been trying to get some basic statistics such as average coverage > (i.e. depth) and heterozigosity using the vcftools binary, but even > though the program seems to run fine I only get the word "nan" where I > expected to have results. Here's an example regarding average coverage. > This was obtained with the command "vcftools --vcf Kraken_RT.vcf --out > Kraken_RT --depth", and the following is the content of the .idepth file: > > INDV N_SITES MEAN_DEPTH > Kraken_RT_bamhi.sorted.bam 0 nan > Kraken_RT_eag1.sorted.bam 0 nan > Kraken_RT_ecori.sorted.bam 0 nan > > > The resulting .log file also appears to give me no useful information: > > VCFtools - v0.1.9.0 > (C) Adam Auton 2009 > > Parameters as interpreted: > --vcf Kraken_RT.vcf > --depth > --out Kraken_RT > > Reading Index file. > File contains 104138656 entries and 3 individuals. > Applying Required Filters. > After filtering, kept 3 out of 3 Individuals > After filtering, kept 104138656 out of a possible 104138656 Sites > Outputting Mean Depth by Individual > Run Time = 773.00 seconds > > The first few lines of my vcf file are at the end of this message. It's > a vcf v4.1 file, which is what I thought was the cause of it initially, > but then thought that there are no differences in the way the different > versions of vcf store information regarding depth (I might have gotten > this part wrong though). Any help would be very much appreciated, as > there doesn't seem to be many alternatives to vcf tools to get all the > information I need from my genome. > > Thanks! > > ##fileformat=VCFv4.1 > ##samtoolsVersion=0.1.18 (r982:295) > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> > ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality > ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> > ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square > mapping quality of covering reads"> > ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all > samples being the same"> > ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate > of the first ALT allele frequency (assuming HWE)"> > ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate > of the first ALT allele count (no HWE assumption)"> > ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype > frequencies"> > ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test > P-value based on G3"> > ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype > likelihoods with and without the constraint"> > ##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable > unconstrained genotype configuration in the trio"> > ##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable > constrained genotype configuration in the trio"> > ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand > bias, baseQ bias, mapQ bias and tail distance bias"> > ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the > variant is an INDEL."> > ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of > the nonRef allele frequency in group1 samples being larger (,smaller) > than in group2."> > ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted > chi^2 P-value for testing the association between group1 and group2 > samples."> > ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2."> > ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding > a smaller PCHI2."> > ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias"> > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> > ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> > ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for > RR,RA,AA genotypes (R=ref,A=alt)"> > ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases"> > ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand > bias P-value"> > ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled > genotype likelihoods"> > #CHROM POS ID REF ALT QUAL > FILTER INFO FORMAT > Kraken_RT_bamhi.sorted.bam Kraken_RT_eag1.sorted.bam > Kraken_RT_ecori.sorted.bam > 1 3568 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3569 . A . 27.8 . > DP=2;AF1=0;AC1=0;DP4=0,1,0,0;MQ=0;FQ=-27 PL > 0 0 0 > 1 3570 . T . 27.8 . > DP=2;AF1=0;AC1=0;DP4=0,1,0,0;MQ=0;FQ=-27 PL > 0 0 0 > 1 3571 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3572 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3573 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3574 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3575 . G . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3576 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3577 . G . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3578 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3579 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3580 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3581 . G . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3582 . C . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3583 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3584 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3585 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3586 . C . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3587 . G . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3588 . G . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3589 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3590 . A . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > 1 3591 . T . 28.9 . > DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL > 0 0 0 > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today:http://ad.doubleclick.net/clk;258768047;13503038;j?http://info.appdynamics.com/FreeJavaPerformanceDownload.html > _______________________________________________ > Vcftools-help mailing lis...@li...://lists.sourceforge.net/lists/listinfo/vcftools-help > > -- Adam Auton Assistant Professor, Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Price Center 320, Bronx, New York 10461 Tel: +1 (718) 678 1038 |