A. Klassmann - 2014-11-05

This package contains implemented formulas for the calculation of the second and third moments of mutations described by a coalescent. Several utilities parse the output of the population-genetic simulation program "ms".
The options listed here are not exhaustive. Further options are shown by calling the tools with option -h.

nt2mom

Prints the expected values of the second moments: E[xi_i xi_j].
Example: nt2mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs2mom

Optionally prints expection values conditional on a mutation of size k.
Example: nt2mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | colmean

This can be further restricted to nested or disjoint mutations with the options -n and -d

nt3mom

Prints the expected values of the third moments: E[xi_h xi_i xi_j].
Example: nt3mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs3mom

Optionally prints second moments conditional on a mutation of size k.
Example: nt3mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | fs2mom

This can be further restricted to nested or disjoint mutations with the options -nn, -nd and -dd

nt3mom_slow

Only for comparison with nt3mom. Uses the unsimplified version of the formulas for the third moments and is hence very slow; yields the same results within rounding errors.

colmean

averages columns over consecutive rows.
Example: echo -e "10 20 \n5 8" | colmean

matmean

averages element-wise over consecutive square matrices.
Example: echo -e "1 1\n2 2\n2 1\n2 3" | matmean

msfs

calculates the standard frequency spectrum out of ms output.
Example: ms 20 1 -t 1 | msfs

ms2loci

calculates the 2-loci frequency spectrum from output of 'ms'.
Example: ms 6 1 -t 1 | ms2loci
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint

mscondfs

calculates the frequency spectrum from output of 'ms', conditional on a (possible multiple) existence of a mutation of size k
Example: ms 6 10 -t 1 | mscondfs 3
This can be further restricted to nested or disjoint mutations with options -n and -d

Compare:
ms 6 100000 -t 1 | mscondfs 3 | colmean
with
nt2mom 6 3

mscond

filters output of 'ms'. (re)prints (part of) the ms output conditional on mutations of size k.
Example: ms 6 10 -t 1 | mscond 3
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint.
'mscond k | msfs' yields the same output as 'mscondfs k'

fs3mom

calculates third moments for tabular data (e.g. frequency spectra).
Example: ms 5 1000 -t 1 | msfs | fs3mom
Option -mu yields the central moments.
Compare e.g. 'ms 5 10000 -t 1 | msfs | fs3mom' with 'nt3mom 5'

There is an analogous 'fs2mom' for the second moments in package 'ntx'.

M2fs

prints the analytical expected spectrum of exactly two nested mutations in the limit of vanishing theta using the formulas of Hobolth/Wiuf 2009.

msM2fs

filters ms-output and prints the frequency spectrum for every sample that contains exactly 2 nested mutations.

Smom

prints the second and third moments of the number of segregating sites S for given sample size n and theta (default 1).
Example: Smom 5 -t 10

testbias

computes the bias and skewness for 5 neutrality tests (like Tajima's D) using the approximating formulas given in the 3rd moments paper.
Example: testbias 10 -t 1

The analytical skewness can be compared with simulated data by using the sourceforge package 'ntx' (which computes the neutrality tests from 'ms' output).
Provided theta is known (and not estimated from the data, hence the additional parameter for ntx), it should be equal to the analytical result:
e.g. for Tajima's D
ms 10 100000 -t 1 | msfs | ntx -t 1 | cut -f 4 | fs3mom -skewness
or for Fay and Wu's H and theta = 100
ms 10 100000 -t 100 | msfs | ntx -t 100 | cut -f 10 | fs3mom -skewness

m2diff

compares two square matrices of same size.
Example:
echo -e "1 1\n2 2\n2 1\n2 3" > m1
echo -e "1 1\n2 1\n2 1\n2 3" > m2
m2diff m1 m2

m3diff

compares two 3d-arrays.

m2elsum

computes the sum of matrix elements
Example:
nt2mom 5 | m2elsum

m3elsum

computes the sum of 3-d array elements
Example:
nt3mom 5 | m3elsum

mscut

filters ms-output. Retains only those segregating sites which positions fall into a specified interval
Example:
ms 10 1 -t 5 | mscut 0.3 0.5

mssub

subsets ms-output. Retains only the first n chromosomes.
Example:
ms 10 1 -t 5 | mssub 5

msTsub

filters or extracts subtrees of size k (needs ms with option -T) [experimental status].

ms2tikzhaplo

translates ms-output into a latex tikz-picture: aligned sequences as lines and derived mutations as balls on top.
Example:
ms 20 1 -t 10 | ms2tikzhaplo -h > mshaplo.tex
pdflatex mshaplo.tex

ms2tikztree

translates ms-output (option -T needed!) into a latex tikz-picture: draws tree with derived mutations on top.
Example:
ms 20 1 -t 10 -T | ms2tikztree -h > mstree.tex
pdflatex mstree.tex

ms2vcf

convert output from ms to vcf. The REF and ALT fields are filled by random nucleotides.

Note that ms uses the "infinite sites model", while a vcf file accommodates a sequence with finite sites. The position numbers of ms (which lie in the interval [0,1]) are multiplied by the "length" of the chromosome (and rounded subsequently) to yield the integer position numbers required by the vcf. The precision of the positions in ms output should be increased to its maximum value by setting "-p 16" in order to reduce the number of identical positions due to rounding.

Different segregating sites placed at the same position in the vcf constitute a violation of the "infinite sites model" on which ms simulation is based and it is unclear how to handle them correctly. The tool implements a quick-and-dirty work-around : in case of multiple SNPs obtaining the same position in the vcf, only the first one (lowest position in ms file) is retained.

Example:
ms 10 2 -t 10 -p 16 | ms2vcf -length 1000 > msout.vcf

Type ms2vcf -h for other options.

ms2SF

convert output from ms to SweepFinder format. That format is a table with four columns,
containing the position of (bi-allelic) SNPs, the absolute number of "second" alleles in the sample and a "folded" flag which is 0 when the "second" allele can be regarded as ancestral and 1 if not.

Example:
ms 10 2 -t 10 -p 16 | ms2SF -length 1000 > msout.SF

ms2fastphase

convert output from ms to input for fastPHASE. Both formats are quite similar. Each two consecutive chromosomes of ms output constitute a diploid individual. The tool has an option to randomize the genotypes at each position (swaping the allele between the two chromosomes with probability 0.5), hence erasing phase information.

Example:
ms 10 2 -t 10 | ms2fastphase -r

 

Last edit: A. Klassmann 2020-08-12