File | Date | Author | Commit |
---|---|---|---|
files | 2015-06-08 |
![]() |
[43f6ae] Version 1.0 |
src | 2015-06-08 |
![]() |
[43f6ae] Version 1.0 |
.gitignore | 2015-06-08 |
![]() |
[43f6ae] Version 1.0 |
LICENSE | 2015-06-08 |
![]() |
[4b1dce] Initial commit |
README.md | 2015-06-08 |
![]() |
[d4cedf] Update README.md |
Inference of divergence times under pure drift and isolation, using the beta with spikes approximation
SpikeyTree infers divergence times between populations that share a common
history described by a tree. SpikeyTree assumes that the populations have
been evolving under isolation, according to the Wright-Fisher with pure drift.
SpikeyTree can use the Beta or the Beta with Spikes approximation to the allele
frequency distribution under the Wright-Fisher.
More details can be found in
Tataru et al. Inference under a Wright-Fisher model using an accurate beta approximation
If using this software in a publication, please cite the above article.
Paula Tataru, paula@cs.au.dk
SpikeTree is implemented in python 2.7 and uses third-party python packages.
These should be downloaded and installed accordingly
- SciPy http://sourceforge.net/projects/scipy/files/scipy/
- NumPy http://sourceforge.net/projects/numpy/files/NumPy/
- newick http://users-birc.au.dk/mailund/newick.html
SpikeyTree can be used to
- simulate data on a tree under the Wright-Fisher (src/wf.py)
- infer divergence times on simulated/real data (src/optimize.py)
Running
python src/wf.py
will display a brief help:
Usage: wf.py [options]
Options:
-h, --help show this help message and exit
-f FILE file with tree to simulate on
-s SHAPES shape parameters for root distribution [default: 1,1]
-n SAMPLE sample sizes for populations [default: 100]
-l LOCI number of loci/sites to simulate [default: 100]
-o OUTPUT file to write simulation to
FILE should contain a tree in Newick format, containing both the number of
generations per branch, but also the population size N. For example,
files/test.txt contains ((C 500 : 44, E 500 : 132) 250 : 14, W 250 : 300);
This says that population C has been evolving for 44 generations since the
split from the common ancestor with population E, and its size during these
44 generations was 500 individuals.
The simulation draws the allele frequency at the root from a beat distribution
with shape parameters given by SHAPES (separated by comma).
SAMPLE specifies how many sequences are sampled from the populations in the
present (i.e. at the leaves). If only one number is given, then all populations
have the same sample size. Otherwise, different sample sizes are separated by
commas. The number of the samples should equal the number of populations in the
present.
LOCI determines the number of simulated (unlinked) loci.
The result of the simulation will be written to OUTPUT. The corresponding
scaled tree (i.e. with branch lengths given by #gen / 2N) is written to
tee_OUTPUT.
wf.py simulates only polymorphic data by using rejection samples: simulated
loci that that are fixed or lost in all samples are rejected.
For example, running
python src\wf.py -f files\test.txt -s 0.01,0.008 -n 10,15,7 -l 300 -o files\sim.txt
will produce two files
- files\sim.txt contain the simulated data
((C, E), W);
the first line shows the topology (tree without branch lengths) in Newick format
10 15 7
the second line shows the sample sizes for each population
C E W
the third line is a header with the population names for
the lines that follow, containing the counts for each locus
- files\tree_sim.txt contains
((C: 0.04400, E: 0.13200): 0.02800, W: 0.60000);
the tree in Newick format with scaled branches followed by the
shape parameters used in the simulation and the branch
lengths in the depth first order;
the branch above population C is given by 44 / (2*500) = 0.044
Running
python src/optimize.py
will display a brief help:
Usage: optimize.py [options]
Options:
-h, --help show this help message and exit
-f FILE input file containing data and tree
-o OUTPUT output file to write optimization result
-T HEIGHT tree height [default: 30]
-K BINS number of bins [default: 20]
-B run beta; otherwise, run beta with spikes [default: False]
-r REP number of repetitions [default: 1]
-t THREADS number of threads to run [default: 1]
FILE contains the input data, in the same format as the simulated data
described before.
The results of the optimization will be written to OUTPUT.
The tree height and the number of bins used in the discretization are given
using HEIGHT and BINS. For details of this, see the Supplementary Material of
Tataru et al. Inference under a Wright-Fisher model using an accurate beta approximation
The option -B toggles between the beta and beta with spikes approximations. By
default, the beta with spikes approximation is used.
The numerical optimization needs a starting point. The starting point can
affect the resulting optimum found. Therefore, the optimization can be run
multiple times, each with a different starting point, specified through REP.
The first point is automatically calculated from Fst obtained from the data,
while the remaining REP-1 runs will have random generated starting points.
The code can run the multiple runs in parallel, and the number of desired
parallel runs is specified in THREADS.
For example, the command for running the optimization on the previously
simulated data using 4 repetitions on 4 threads is
python src\optimize.py -f files\sim.txt -o files\out -r 4 -t 4
The program will print to screen the progress of the optimization runs. Each
line is of the format
Iteration x. y. z: lk
where x corresponds to the current run (from 1 to REP), y is the iteration
in the optimization call, and z is the number of calls used to the likelihood
function. lk contains the current log likelihood. The result will be one file,
files\out_spikes.txt. Running the same command with option -B will produce a
file named files\out.txt
The output files contain a summary for all REP runs:
- the maximum log likelihood found
- the starting tree with the scaled branch lengths and shape parameters
- the optimized tree and the shape parameters
- the branch lengths in depth first order of the optimzied tree
See the file LICENSE.txt for information on terms & conditions for usage,
and a DISCLAIMER OF ALL WARRANTIES.