This repository contains code for the calculation of 'neutrality test' using the site frequency spectrum. A site frequency spectrum is a standard way to describe the pattern of genomic variation (mutations) in a sample of individuals. Its form can be influenced by selection, demography and other factors and this is exploited to infer aspects of evolutionary history out of present genetic data. Of particular interest are genomic regions which experienced selection in the recent past of a species. Multiple tests have been developed to distinguish between selection and "neutrality". Some of them are implemented in the main tool "ntx", including the possibility to adapt those tests to specific demographies.
The frequency spectrum is assumed to be given as integers in a line, for example in the case of a sample of size n=6 with 5 mutations this might look like:
2 1 1 0 1
(Two mutations are present in a single individual, one in two, one in three and another in 5 individuals.)
External moments can be specified by a file which contains in the first line the first moments, followed by an empty line and the matrix sigma (secondary moments, except for diagonal elements, cf. Y.X. Fu 1995). The tool "ntmom" analytically calculates the values for the standard case of constant demography. The tools "ms2fs" and "fs2mom" can be used to estimate these moments from "ms"-output.
Unpack the file with "tar xvfz ntx-version.tar.gz"
Type "make" to compile the programs.
The utility "ms2fs" can (after appropriate changes in the Makefile) be compiled so that it uses "libsequence" (and the corresponding tools), which can be obtained from http://molpopgen.org/software/libsequence.html
This library is itself dependent on the numerical package "boost" and (since version 1.8.0) on "z".
ntx
reads one or more frequency spectra from console and calculates Tajima's D and other similar tests. These tests were originally defined for constant population size, yet the program allows the specification of arbitrary first and second moments in order to allow for demographic effects.
It outputs the estimated value of theta and various test statistics: Tajima's D, Fu&Li's D2, Fu&Li's F, Fay&Wu's H, Zeng's E and Achaz' Y.
Options: -f nsam input spectrum is folded and derives from sample of size nsam
-p precision (digits) of output
-x file where the file contains the expected frequency spectrum under the null hypothesis followed by the matrix sigma
-t theta use this theta (instead of estimated) for variance computation
-tw file the file containing weights used for theta estimation
-u do not normalize the test
-delta print delta-spectrum
-mom print out expected moments
Options:
-cov output covariance matrix instead of E[xi_i xi_j]
-f nsam input spectrum is folded and derives from sample of size nsam
-p precision (digits after decimal point) of output
-t [theta] specify theta value (e.g. known if simulated data)
-n normalize by estimated theta
-ntx output the estimated spectrum together with matrix sigma (for ntx input)
-sigma output sigma matrix instead of E[xi_i xi_j]
-v print theta estimate
-x [file] read first moments from file (e.g. know if simulated data), used for theta estimation
Achaz, G., 2008 Testing for neutrality in samples with sequencing errors. Genetics 179 (3): 1409–1424.
Fay, J. C. and C. I. Wu, 2000 Hitchhiking under positive Darwinian selection. Genetics 155 (3): 1405–13.
Fu, Y.-X. and W.-H. Li, 1993b Statistical tests of neutrality of mutations. Genetics 133: 693–709.
Fu, Y.-X., 1995 Statistical Properties of Segregating Sites. Theoretical Population Biology 48: 172–197.
Rafajlović, M., et al, 2014 Demography-adjusted tests of neutrality based on genome-wide SNP data.
Theoretical Population Biology (in press).
Tajima, F., 1989 The effect of change in population size on DNA polymorphism. Genetics 123 (3): 597–601.
Zeng, K., Y.-X. Fu, S. Shi, and C.-I. Wu, 2006 Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics 174 (3): 1431–9.
Last edit: A. Klassmann 2017-01-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This repository contains code for the calculation of 'neutrality test' using the site frequency spectrum. A site frequency spectrum is a standard way to describe the pattern of genomic variation (mutations) in a sample of individuals. Its form can be influenced by selection, demography and other factors and this is exploited to infer aspects of evolutionary history out of present genetic data. Of particular interest are genomic regions which experienced selection in the recent past of a species. Multiple tests have been developed to distinguish between selection and "neutrality". Some of them are implemented in the main tool "ntx", including the possibility to adapt those tests to specific demographies.
The frequency spectrum is assumed to be given as integers in a line, for example in the case of a sample of size n=6 with 5 mutations this might look like:
2 1 1 0 1
(Two mutations are present in a single individual, one in two, one in three and another in 5 individuals.)
External moments can be specified by a file which contains in the first line the first moments, followed by an empty line and the matrix sigma (secondary moments, except for diagonal elements, cf. Y.X. Fu 1995). The tool "ntmom" analytically calculates the values for the standard case of constant demography. The tools "ms2fs" and "fs2mom" can be used to estimate these moments from "ms"-output.
Unpack the file with "tar xvfz ntx-version.tar.gz"
Type "make" to compile the programs.
The utility "ms2fs" can (after appropriate changes in the Makefile) be compiled so that it uses "libsequence" (and the corresponding tools), which can be obtained from http://molpopgen.org/software/libsequence.html
This library is itself dependent on the numerical package "boost" and (since version 1.8.0) on "z".
ntx
reads one or more frequency spectra from console and calculates Tajima's D and other similar tests. These tests were originally defined for constant population size, yet the program allows the specification of arbitrary first and second moments in order to allow for demographic effects.
It outputs the estimated value of theta and various test statistics: Tajima's D, Fu&Li's D2, Fu&Li's F, Fay&Wu's H, Zeng's E and Achaz' Y.
Usage: ntx [-f nsam] [-p precision] [-u] [-x file] [-t theta][-tw file] [-mom] [-delta]
Options: -f nsam input spectrum is folded and derives from sample of size nsam
-p precision (digits) of output
-x file where the file contains the expected frequency spectrum under the null hypothesis followed by the matrix sigma
-t theta use this theta (instead of estimated) for variance computation
-tw file the file containing weights used for theta estimation
-u do not normalize the test
-delta print delta-spectrum
-mom print out expected moments
Examples:
echo 2 1 1 0 1 | ntx
echo 3 1 1 | ntx -f 6
ntmom
calculates the (exact) first and second moments under the assumption of a population of constant size. The formulas were derived by Y.X. Fu 1995.
Usage: ntmom [-f] nsam [-p precision (6)] [-t theta (1.0)]
Examples:
ntmom 6
ntmom -t 5
ntmom 6 -f
fs2mom
estimates the second moments E[xi_i xi_j] out of a series of measured (or simulated) frequency spectra.
Usage: fs2mom [-cov][-f nsam] [-n] [-ntx][-p precision] [-sigma][-t theta] [-v]
Options:
-cov output covariance matrix instead of E[xi_i xi_j]
-f nsam input spectrum is folded and derives from sample of size nsam
-p precision (digits after decimal point) of output
-t [theta] specify theta value (e.g. known if simulated data)
-n normalize by estimated theta
-ntx output the estimated spectrum together with matrix sigma (for ntx input)
-sigma output sigma matrix instead of E[xi_i xi_j]
-v print theta estimate
-x [file] read first moments from file (e.g. know if simulated data), used for theta estimation
Examples:
ms 6 1000 -t 5 | ms2fs | fs2mom
ms 6 1000 -t 5 | ms2fs | fs2mom -t 5
ms2fs
calculates the site frequency spectrum from the output of 'ms', a broadly used simulation program for the evolution of genetic sequences http://home.uchicago.edu/rhudson1/source/mksamples.html
References
Achaz, G., 2008 Testing for neutrality in samples with sequencing errors. Genetics 179 (3): 1409–1424.
Fay, J. C. and C. I. Wu, 2000 Hitchhiking under positive Darwinian selection. Genetics 155 (3): 1405–13.
Fu, Y.-X. and W.-H. Li, 1993b Statistical tests of neutrality of mutations. Genetics 133: 693–709.
Fu, Y.-X., 1995 Statistical Properties of Segregating Sites. Theoretical Population Biology 48: 172–197.
Rafajlović, M., et al, 2014 Demography-adjusted tests of neutrality based on genome-wide SNP data.
Theoretical Population Biology (in press).
Tajima, F., 1989 The effect of change in population size on DNA polymorphism. Genetics 123 (3): 597–601.
Zeng, K., Y.-X. Fu, S. Shi, and C.-I. Wu, 2006 Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics 174 (3): 1431–9.
Last edit: A. Klassmann 2017-01-30