File | Date | Author | Commit |
---|---|---|---|
examples | 2014-10-30 |
![]() |
[673c9a] Initial commit |
src | 2015-12-03 |
![]() |
[e3c7b3] Fixing Makefile to enable compilation with clan... |
Makefile | 2017-03-24 |
![]() |
[8ebf66] Removing -fopenmp from CXXFLAGS |
README.md | 2015-12-03 |
![]() |
[68a0f6] Makefile detects OS and changes the install nam... |
fastNeutrino
(fast Ne and mutation rate estimation using analytic optimization) is an inference program that fits a demographic model to the sample frequency spectrum (SFS) observed at a collection of loci. It also infers per-locus mutation rates if locus lengths are provided. The inference is performed under Kingman's coalescent model of evolution with the infinite sites model of mutation. fastNeutrino
can support the inference of parameters of piecewise-exponential demographic models, and uses the Ipopt
optimizer along with the ADOL-C
automatic differentiation library for performing optimization. The details of the algorithm can be found in this paper:
Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data
Bhaskar A., Wang Y.X.R, Song Y.S.
Genome Research, 25(2): 268–279, 2015
For questions and bug reports, please contact Anand Bhaskar (anand.bhaskar@gmail.com).
fastNeutrino
depends on the following libraries: Ipopt
, ADOL-C
, ExprTk
, and Boost
.
Boost
, please follow the instructions at boost.org. You only need to compile the Boost
program_options
library for use with fastNeutrino
. You might find it helpful to look at the section Easy Build and Install instructions. To install Boost
program_options
, you would do something like:$ ./bootstrap.sh --prefix=path/to/install/boost --with-libraries=program_options
$ ./b2 install
If you are using Mac OS X, you can also install Boost
using Homebrew
.
After installing Boost
, edit the BOOST_DIR
Makefile variable to point to the directory where you installed Boost
(i.e. the directory passed to the --prefix
above). If you installed Boost
using Homebrew
, it was most likely installed in the /usr/local
directory.
ADOL-C
can be downloaded here. fastNeutrino
has been tested with version 2.3.0
of ADOL-C
. After downloading, extract the files into some directory. You don't need to compile or install ADOL-C
, since fastNeutrino
only uses the header file adouble.h
in ADOL-C
. Edit the fastNeutrino
Makefile variable ADOL_C_DIR
to point to the directory where ADOL-C
was extracted to.
ExprTk
can be downloaded here. After downloading, edit the fastNeutrino
Makefile variable EXPRTK_DIR
to point to the directory where ExprTk
was extracted to.
Ipopt
: Download the Ipopt
source here, and extract it. fastNeutrino
has been tested with version 3.12.4
of Ipopt
. If you download a more recent version of Ipopt
and face compilation problems, please use version 3.12.4
. Edit the Makefile variable IPOPT_DIR
in the fastNeutrino
Makefile to point to the directory where Ipopt
was extracted to.
In order to use Ipopt
, you will also need to download the COIN-HSL
archive here. This archive contains various Fortran linear solvers that are used by Ipopt
. Place the downloaded COIN-HSL
archive inside the ThirdParty/HSL
directory of Ipopt
, and do the following:
$ cd ThirdParty/HSL
$ tar -zxf coinhsl-x.y.z.tar.gz
$ ln -s coinhsl-x.y.z coinhsl
In the above commands, replace coinhsl-x.y.z
with the directory name of the unzipped COIN-HSL
archive.
To make Ipopt
, go to the directory where fastNeutrino
is located and do:
$ make ipopt
Finally, to compile fastNeutrino
, do:
$ make
The first line of the input SFS data file has the sample size n (in haploids) and the number of loci, L. Each of the next n + 1 lines has L columns, with the i-th line giving the number of sites for each locus with i alleles of one type and n - i of the other type. The inference can be done on this data or on the folded version where the counts in lines i and n - i are merged to get minor allele counts. Also, the first and last lines (which are counts of the number of monomorphic sites) only matter for the sake of determining the total length of the locus in basepairs in order to estimate the mutation rate in physical units. An example SFS (which is also the exome-sequencing dataset we analyzed in [1]) is given in the examples
directory.
The first line of the model file specifies the ancestral population size. Each subsequent line has one of these 2 types depending on whether the epoch is an exponential phase or a constant population size:
N
t
r
: this says that at time t
(before present, which is time 0), the population size instantaneously changes to a value x
such that it grows at rate r
(i.e. r
* 100%) per generation until it becomes N
at the next demographic event (or time 0, whichever happens first).N
t
: at time t
, the population size instantly becomes N
.All times are in units of generations before present, where the present time is 0 and time increases going back in the past. All population sizes are specified in haploids.
You can infer any parameter by simply replacing it with a ?
symbol and giving it a variable name. However, you have to specify the ancestral population size, since there is a degree of freedom in the scaling in the coalescent.
For an exponential growth epoch, if you wish the starting population size at the onset of growth to be the same as the population size an epsilon amount of time before (i.e. to preserve continuity of the population size function), then the N
parameter for epoch should be replaced with a *
symbol. See the file model1ExponentialEpoch.txt for an example.
The next line is an integer indicating the number of free parameters which have a box constraint on them, followed by a line per constraint with format param lb ub
. This constrains param
to be between lb
and ub
during the optimization.
The next line is an integer indicating the number of dependent parameters, followed by a line per dependent parameter of the form param expr
. expr
can be pretty much any mathematical expression involving only independent parameters that describes a way to calculate the value of the dependent parameter param
using the free parameters. For example, see the file model1ExponentialEpoch1BottleneckFixedDuration.txt where the bottleneck duration is fixed to 100 generations by setting the time of the start of the bottleneck to be 100 generations further in the past than the time of the end of the bottleck.
In the examples
folder, you can find some sample demographic models.
model1ExponentialEpoch.txt
: Two parameters are being inferred here. The t
parameter is the number of generations in the past when an exponential epoch phase began, and r
is the rate of exponential expansion. The *
in the line corresponding to the exponential epoch denotes that the population sizes just before and after the exponential epoch are the same (i.e. the population size function remains continuous at that time).
model1ExponentialEpoch1Bottleneck.txt
: In addition to the two parameters of the exponential epoch being estimated above, there is also a bottleneck of size NB
which starts at time t3
in the past, lasts until time t2
at which time the population size becomes N1
. Then at time t
, the population expands/contracts at rate r
.
model1ExponentialEpoch1BottleneckFixedDuration.txt
: Similar to the file model1ExponentialEpoch1Bottleneck.txt
, but now the bottleneck is constrained to last for 100 generations since t3
is set to t2 + 100
in the dependent parameter expression in the last line of the file.
In all these examples, there are no bound constraints set on the free parameters. However, there are several implicit constraints set in the program. The population sizes have to be non-negative, the epoch onset times are ordered, and the expansion/contraction rate is at most 40% per generation.
For a full list of program options, use
fastNeutrino --help
```.
The files used in these examples are available in the `examples` folder.
* ```
fastNeutrino --maxB 20 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 < sfsNelsonEtAl.txt
This infers the free parameters of the demographic model specified in file model1ExponentialEpoch.txt, on the SFS stored in file sfsNelsonEtAl.txt. The inference is done from 10 random initialization points, and the best demographic model is output to model1ExponentialEpochInferredParams.txt. All the segregating sites which have derived allele count >= 20 are collapsed into one bin for the inference.
Same as in the previous example but using the folded spectrum computed from the full SFS specified in sfsNelsonEtAl.txt.
* ```
fastNeutrino --adaptiveMaxB 0.95 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 < sfsNelsonEtAl.txt
Same as in the first example, but now all the segregating sites which fall in the bottom 5% in derived allele count are binned together for the inference.
fastNeutrino --maxB 20 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 --startPoint 100,0.05 < sfsNelsonEtAl.txt
?
symbols in model1ExponentialEpoch.txt) are replaced by the comma-separated list of numbers given after the --startPoint argument (in that order of appearance) for the starting point of the optimization.