fastNeutrino Code

FAST Ne and mUTatation Rate Inference using aNalytic Optimization

Brought to you by: anand-bhaskar, yssong

Tree [8ebf66] master /

History

HTTPS access

File	Date	Author	Commit
examples	2014-10-30	bhaskar	[673c9a] Initial commit
src	2015-12-03	Anand Bhaskar	[e3c7b3] Fixing Makefile to enable compilation with clan...
Makefile	2017-03-24	Anand Bhaskar	[8ebf66] Removing -fopenmp from CXXFLAGS
README.md	2015-12-03	Anand Bhaskar	[68a0f6] Makefile detects OS and changes the install nam...

Read Me

About

fastNeutrino (fast Ne and mutation rate estimation using analytic optimization) is an inference program that fits a demographic model to the sample frequency spectrum (SFS) observed at a collection of loci. It also infers per-locus mutation rates if locus lengths are provided. The inference is performed under Kingman's coalescent model of evolution with the infinite sites model of mutation. fastNeutrino can support the inference of parameters of piecewise-exponential demographic models, and uses the Ipopt optimizer along with the ADOL-C automatic differentiation library for performing optimization. The details of the algorithm can be found in this paper:

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data
Bhaskar A., Wang Y.X.R, Song Y.S.
Genome Research, 25(2): 268–279, 2015

For questions and bug reports, please contact Anand Bhaskar (anand.bhaskar@gmail.com).

Installation

fastNeutrino depends on the following libraries: Ipopt, ADOL-C, ExprTk, and Boost.

To download and install Boost, please follow the instructions at boost.org. You only need to compile the Boost program_options library for use with fastNeutrino. You might find it helpful to look at the section Easy Build and Install instructions. To install Boost program_options, you would do something like:

$ ./bootstrap.sh --prefix=path/to/install/boost --with-libraries=program_options
$ ./b2 install

If you are using Mac OS X, you can also install Boost using Homebrew.
After installing Boost, edit the BOOST_DIR Makefile variable to point to the directory where you installed Boost (i.e. the directory passed to the --prefix above). If you installed Boost using Homebrew, it was most likely installed in the /usr/local directory.

ADOL-C can be downloaded here. fastNeutrino has been tested with version 2.3.0 of ADOL-C. After downloading, extract the files into some directory. You don't need to compile or install ADOL-C, since fastNeutrino only uses the header file adouble.h in ADOL-C. Edit the fastNeutrino Makefile variable ADOL_C_DIR to point to the directory where ADOL-C was extracted to.
ExprTk can be downloaded here. After downloading, edit the fastNeutrino Makefile variable EXPRTK_DIR to point to the directory where ExprTk was extracted to.
Ipopt : Download the Ipopt source here, and extract it. fastNeutrino has been tested with version 3.12.4 of Ipopt. If you download a more recent version of Ipopt and face compilation problems, please use version 3.12.4. Edit the Makefile variable IPOPT_DIR in the fastNeutrino Makefile to point to the directory where Ipopt was extracted to.
In order to use Ipopt, you will also need to download the COIN-HSL archive here. This archive contains various Fortran linear solvers that are used by Ipopt. Place the downloaded COIN-HSL archive inside the ThirdParty/HSL directory of Ipopt, and do the following:

$ cd ThirdParty/HSL
$ tar -zxf coinhsl-x.y.z.tar.gz
$ ln -s coinhsl-x.y.z coinhsl

In the above commands, replace coinhsl-x.y.z with the directory name of the unzipped COIN-HSL archive.

To make Ipopt, go to the directory where fastNeutrino is located and do:

$ make ipopt

Finally, to compile fastNeutrino, do:

$ make

Usage

Input SFS

The first line of the input SFS data file has the sample size n (in haploids) and the number of loci, L. Each of the next n + 1 lines has L columns, with the i-th line giving the number of sites for each locus with i alleles of one type and n - i of the other type. The inference can be done on this data or on the folded version where the counts in lines i and n - i are merged to get minor allele counts. Also, the first and last lines (which are counts of the number of monomorphic sites) only matter for the sake of determining the total length of the locus in basepairs in order to estimate the mutation rate in physical units. An example SFS (which is also the exome-sequencing dataset we analyzed in [1]) is given in the examples directory.

Demographic model

The first line of the model file specifies the ancestral population size. Each subsequent line has one of these 2 types depending on whether the epoch is an exponential phase or a constant population size:

e N t r : this says that at time t (before present, which is time 0), the population size instantaneously changes to a value x such that it grows at rate r (i.e. r * 100%) per generation until it becomes N at the next demographic event (or time 0, whichever happens first).
c N t : at time t, the population size instantly becomes N.

All times are in units of generations before present, where the present time is 0 and time increases going back in the past. All population sizes are specified in haploids.

You can infer any parameter by simply replacing it with a ? symbol and giving it a variable name. However, you have to specify the ancestral population size, since there is a degree of freedom in the scaling in the coalescent.
For an exponential growth epoch, if you wish the starting population size at the onset of growth to be the same as the population size an epsilon amount of time before (i.e. to preserve continuity of the population size function), then the N parameter for epoch should be replaced with a * symbol. See the file model1ExponentialEpoch.txt for an example.

The next line is an integer indicating the number of free parameters which have a box constraint on them, followed by a line per constraint with format param lb ub. This constrains param to be between lb and ub during the optimization.

The next line is an integer indicating the number of dependent parameters, followed by a line per dependent parameter of the form param expr. expr can be pretty much any mathematical expression involving only independent parameters that describes a way to calculate the value of the dependent parameter param using the free parameters. For example, see the file model1ExponentialEpoch1BottleneckFixedDuration.txt where the bottleneck duration is fixed to 100 generations by setting the time of the start of the bottleneck to be 100 generations further in the past than the time of the end of the bottleck.

Example demographic models

In the examples folder, you can find some sample demographic models.

model1ExponentialEpoch.txt : Two parameters are being inferred here. The t parameter is the number of generations in the past when an exponential epoch phase began, and r is the rate of exponential expansion. The * in the line corresponding to the exponential epoch denotes that the population sizes just before and after the exponential epoch are the same (i.e. the population size function remains continuous at that time).
model1ExponentialEpoch1Bottleneck.txt : In addition to the two parameters of the exponential epoch being estimated above, there is also a bottleneck of size NB which starts at time t3 in the past, lasts until time t2 at which time the population size becomes N1. Then at time t, the population expands/contracts at rate r.
model1ExponentialEpoch1BottleneckFixedDuration.txt : Similar to the file model1ExponentialEpoch1Bottleneck.txt, but now the bottleneck is constrained to last for 100 generations since t3 is set to t2 + 100 in the dependent parameter expression in the last line of the file.

In all these examples, there are no bound constraints set on the free parameters. However, there are several implicit constraints set in the program. The population sizes have to be non-negative, the epoch onset times are ordered, and the expansion/contraction rate is at most 40% per generation.

Example usage

For a full list of program options, use

fastNeutrino --help
```.

The files used in these examples are available in the `examples` folder.

* ```
fastNeutrino --maxB 20 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 < sfsNelsonEtAl.txt

This infers the free parameters of the demographic model specified in file model1ExponentialEpoch.txt, on the SFS stored in file sfsNelsonEtAl.txt. The inference is done from 10 random initialization points, and the best demographic model is output to model1ExponentialEpochInferredParams.txt. All the segregating sites which have derived allele count >= 20 are collapsed into one bin for the inference.

```
fastNeutrino --maxB 20 --foldSpectrum --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 < sfsNelsonEtAl.txt

    Same as in the previous example but using the folded spectrum computed from the full SFS specified in sfsNelsonEtAl.txt.


* ```
fastNeutrino --adaptiveMaxB 0.95 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 < sfsNelsonEtAl.txt

Same as in the first example, but now all the segregating sites which fall in the bottom 5% in derived allele count are binned together for the inference.

fastNeutrino --maxB 20 --modelFile model1ExponentialEpoch.txt --inferredModelOutputFile model1ExponentialEpochInferredParams.txt --maxRandomRestarts 10 --startPoint 100,0.05 < sfsNelsonEtAl.txt
The first run of optimization (out of the 10 restarts) starts at the specified point. Each of the free parameters (the variables preceded by the ? symbols in model1ExponentialEpoch.txt) are replaced by the comma-separated list of numbers given after the --startPoint argument (in that order of appearance) for the starting point of the optimization.