Hi,
I think there is a bug in version 20150910 when producing the VCF file. The allele frequencies as calculated from the .vcf file often differ from the ones indicated in the popFreq (or in the standard sfs code output files).
Suggestion to debug
Replacing the 3 instances of
storage->chrs[j] > gpars.P*i+p
by
storage->chrs[j] > (pop+1)+gpars.P*i+p
at lines from 2071 to 2094 in btfuncs.c seems to improve the issue. While it prevents from very large differences in allele frequencies between the popFreq and the .vcf files, there are still slight mismatch in allele frequency of the order of 1/SS, where SS is the sample size.
Code to witness the bug
Below is the code I used to investigate this issue. Run the code with version 20150910 first to witness the issue. Then, repeat the whole code with the debug version of sfs_code (new btfunc.c and executable attached) to test that everything went fine.
### Executable. Please edit the path.
sfs_code="/path/to/SFS_code"
### Number of populations
nbpops=4
### Run Simulation.
${sfs_code} 4 1 -TS 0 0 1 -TS 0 0 2 -TS 0 0 3 -o out.vcf --VCF --popFreq popFreq.txt -L 1 999 -N 1000 -n 500 --seed 684
### Loop over all population
for d in $(seq 0 $((${nbpops} - 1)));do
### Create a file with the names of individuals that should be kept to calculate allele frequency from the .vcf
eval echo N${d}.{0..$(( 500 - 1 ))} | tr " " "\n" > ${d}_pop.txt
### Calculate allele frequency from .vcf (needs `vcftools` installed)
vcftools --vcf out.vcf --freq --keep ${d}_pop.txt --out ${d}_vcffreq
### Remove unnecessary information on the output of vcftools to keep only the frequency of the allele of interest
tail -n +2 ${d}_vcffreq.frq > tmp.txt && mv tmp.txt ${d}_vcffreq.frq
cat ${d}_vcffreq.frq | awk -F "\t" '{print $6}' | sed 's/.://g' > tmp.txt && mv tmp.txt ${d}_vcffreq.frq
done
### Run the R script that will compare the frequencies from the 'popFreq.txt' file and the allele frequencies calculated from the '.frq' files (outputs of vcftools)
Rscript compareFreqs.r ${nbpops} _vcffreq.frq popFreq.txt
Attachment
I attached:
- compareFreqs.r (used to witness the bug)
- a version of the btfuncs.c (called btfuncs_DEBUG_REMI.c) in which I replaced instances of storage->chrs[j] > gpars.P*i+p by storage->chrs[j] > (pop+1)+gpars.P*i+p as explained above.
- an executable (compiled on Mac OS X) called sfs.code_DEBUG_REMI.c
Thank you,
Remi
Anonymous