|
From: Andre M. <av...@gm...> - 2012-09-26 13:45:09
|
Dear Adam,
Following up on this, I can now calculate depth just fine, but still
having problems with heterozigosity. Here's the content of the .het file:
INDV O(HOM) E(HOM) N_SITES F
FullMerge.bam 0 0.0 1942564 0.00000
The log file doesn't appear to show anything wrong (at the end of the
email), but I noticed that my vcf file only has the GT information in
the FORMAT field for the variable positions relative to the reference.
I'm wondering if that might be a problem and is thus causing my results.
I'm also wondering about the meaning of the "Warning - file contains
entries with the same position. These entries will be processed
separately." message in the log file and what the implication of it are
for subsequent analysis.
Thanks,
Andre
VCFtools - v0.1.9.0
(C) Adam Auton 2009
Parameters as interpreted:
--vcf FullMerge-Test2.vcf
--het
--out FullMerge-Test2
--site-pi
Reading Index file.
Building new index file.
Scanning Chromosome: 1
Warning - file contains entries with the same position. These
entries will be processed separately.
Scanning Chromosome: 2
Scanning Chromosome: 3
Scanning Chromosome: 4
etc..
Writing Index file.
File contains 2142048890 entries and 1 individuals.
Applying Required Filters.
After filtering, kept 1 out of 1 Individuals
After filtering, kept 2142048890 out of a possible 2142048890 Sites
Outputting Individual Heterozygosity
Individual Heterozygosity: Only using biallelic SNPs.
Outputting Per-Site Nucleotide Diversity Statistics...
sitePi: Only using biallelic sites.
Run Time = 44816.00 seconds
~
On 24/09/2012 14:57, Adam Auton wrote:
> Yes - you will need GT fields to calculate heterozygosity.
>
> Kind regards,
>
> Adam
>
>
> On 24 September 2012 04:37, Andre Moura <av...@gm...
> <mailto:av...@gm...>> wrote:
>
> Hi Adam,
>
> Thanks a lot for your swift reply. Yes, your explanation helps an
> awful lot, as it helps me understand the workings of vcftools as
> well. So, if I understand correctly, vcftool will retrieve
> information from the FORMAT field and not from the INFO field. I
> was assuming that it would be able to read from the INFO field,
> hence my confusion. Given this, to calculate Heterozigosity and
> nucelotide diversity, then I would need to have information such
> as GT in the FORMAT field as well, is this correct? Rebuilding the
> vcf file to include the necessary information is no problem, so as
> long as I know which fields need to be there I can repeat the
> analysis as appropriate.
>
> Best regards,
>
> Andre
>
> On 24/09/2012 00:00, Adam Auton wrote:
>
> Hello Andre,
>
> Thank you for the clear description of your problem. The issue is that your VCF file does not contain depth information for each sample (specifically, there is no DP entry in the FORMAT field). Vcftools calculates the depth using the per-sample data, and hence can't find the required information in your file.
>
> However, I do note that your VCF file contains depth information in the INFO field. Maybe this is what you need? If so, then you can use something like the following command to extract that field:
>
> vcftools --vcf in.vcf --get-INFO DP
>
> I hope that helps.
>
> Adam
>
> --
> Adam Auton
> Assistant Professor,
> Department of Genetics,
> Albert Einstein College of Medicine,
> 1301 Morris Park Avenue,
> Price Center 320,
> Bronx, New York 10461
>
> Tel:+1 (718) 678 1038 <tel:%2B1%20%28718%29%20678%201038>
>
> On Sep 23, 2012, at 6:30 PM, Andre Moura<av...@gm...> <mailto:av...@gm...> wrote:
>
>
>> Hi everyone,
>>
>> I've just started trying to use vcftools on a genome I've been working,
>> but encounter a problem whose origin is not immediately clear to me.
>> I've been trying to get some basic statistics such as average coverage
>> (i.e. depth) and heterozigosity using the vcftools binary, but even
>> though the program seems to run fine I only get the word "nan" where I
>> expected to have results. Here's an example regarding average coverage.
>> This was obtained with the command "vcftools --vcf Kraken_RT.vcf --out
>> Kraken_RT --depth", and the following is the content of the .idepth file:
>>
>> INDV N_SITES MEAN_DEPTH
>> Kraken_RT_bamhi.sorted.bam 0 nan
>> Kraken_RT_eag1.sorted.bam 0 nan
>> Kraken_RT_ecori.sorted.bam 0 nan
>>
>>
>> The resulting .log file also appears to give me no useful information:
>>
>> VCFtools - v0.1.9.0
>> (C) Adam Auton 2009
>>
>> Parameters as interpreted:
>> --vcf Kraken_RT.vcf
>> --depth
>> --out Kraken_RT
>>
>> Reading Index file.
>> File contains 104138656 entries and 3 individuals.
>> Applying Required Filters.
>> After filtering, kept 3 out of 3 Individuals
>> After filtering, kept 104138656 out of a possible 104138656 Sites
>> Outputting Mean Depth by Individual
>> Run Time = 773.00 seconds
>>
>> The first few lines of my vcf file are at the end of this message. It's
>> a vcf v4.1 file, which is what I thought was the cause of it initially,
>> but then thought that there are no differences in the way the different
>> versions of vcf store information regarding depth (I might have gotten
>> this part wrong though). Any help would be very much appreciated, as
>> there doesn't seem to be many alternatives to vcf tools to get all the
>> information I need from my genome.
>>
>> Thanks!
>>
>> ##fileformat=VCFv4.1
>> ##samtoolsVersion=0.1.18 (r982:295)
>> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
>> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality
>> ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
>> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square
>> mapping quality of covering reads">
>> ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all
>> samples being the same">
>> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate
>> of the first ALT allele frequency (assuming HWE)">
>> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate
>> of the first ALT allele count (no HWE assumption)">
>> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype
>> frequencies">
>> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^^2 based HWE test
>> P-value based on G3">
>> ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype
>> likelihoods with and without the constraint">
>> ##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable
>> unconstrained genotype configuration in the trio">
>> ##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable
>> constrained genotype configuration in the trio">
>> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand
>> bias, baseQ bias, mapQ bias and tail distance bias">
>> ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the
>> variant is an INDEL.">
>> ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of
>> the nonRef allele frequency in group1 samples being larger (,smaller)
>> than in group2.">
>> ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted
>> chi^^2 P-value for testing the association between group1 and group2
>> samples.">
>> ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
>> ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding
>> a smaller PCHI2.">
>> ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
>> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
>> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
>> ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for
>> RR,RA,AA genotypes (R=ref,A=alt)">
>> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
>> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand
>> bias P-value">
>> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled
>> genotype likelihoods">
>> #CHROM POS ID REF ALT QUAL
>> FILTER INFO FORMAT
>> Kraken_RT_bamhi.sorted.bam Kraken_RT_eag1.sorted.bam
>> Kraken_RT_ecori.sorted.bam
>> 1 3568 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3569 . A . 27.8 .
>> DP=2;AF1=0;AC1=0;DP4=0,1,0,0;MQ=0;FQ=-27 PL
>> 0 0 0
>> 1 3570 . T . 27.8 .
>> DP=2;AF1=0;AC1=0;DP4=0,1,0,0;MQ=0;FQ=-27 PL
>> 0 0 0
>> 1 3571 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3572 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3573 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3574 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3575 . G . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3576 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3577 . G . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3578 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3579 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3580 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3581 . G . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3582 . C . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3583 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3584 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3585 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3586 . C . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3587 . G . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3588 . G . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3589 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3590 . A . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>> 1 3591 . T . 28.9 .
>> DP=2;AF1=0;AC1=0;DP4=0,2,0,0;MQ=0;FQ=-27.7 PL
>> 0 0 0
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://ad.doubleclick.net/clk;258768047;13503038;j?
>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>> _______________________________________________
>> Vcftools-help mailing list
>> Vcf...@li... <mailto:Vcf...@li...>
>> https://lists.sourceforge.net/lists/listinfo/vcftools-help
>
>
>
>
> --
> Adam Auton
> Assistant Professor,
> Department of Genetics,
> Albert Einstein College of Medicine,
> 1301 Morris Park Avenue,
> Price Center 320,
> Bronx, New York 10461
>
> Tel: +1 (718) 678 1038
>
|