This package contains implemented formulas for the calculation of the second and third moments of mutations described by a coalescent. Several utilities parse the output of the population-genetic simulation program "ms".
The options listed here are not exhaustive. Further options are shown by calling the tools with option -h.
nt2mom
Prints the expected values of the second moments: E[xi_i xi_j].
Example: nt2mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs2mom
Optionally prints expection values conditional on a mutation of size k.
Example: nt2mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | colmean
This can be further restricted to nested or disjoint mutations with the options -n and -d
nt3mom
Prints the expected values of the third moments: E[xi_h xi_i xi_j].
Example: nt3mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs3mom
Optionally prints second moments conditional on a mutation of size k.
Example: nt3mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | fs2mom
This can be further restricted to nested or disjoint mutations with the options -nn, -nd and -dd
nt3mom_slow
Only for comparison with nt3mom. Uses the unsimplified version of the formulas for the third moments and is hence very slow; yields the same results within rounding errors.
calculates the standard frequency spectrum out of ms output.
Example: ms 20 1 -t 1 | msfs
ms2loci
calculates the 2-loci frequency spectrum from output of 'ms'.
Example: ms 6 1 -t 1 | ms2loci
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint
mscondfs
calculates the frequency spectrum from output of 'ms', conditional on a (possible multiple) existence of a mutation of size k
Example: ms 6 10 -t 1 | mscondfs 3
This can be further restricted to nested or disjoint mutations with options -n and -d
Compare:
ms 6 100000 -t 1 | mscondfs 3 | colmean
with
nt2mom 6 3
mscond
filters output of 'ms'. (re)prints (part of) the ms output conditional on mutations of size k.
Example: ms 6 10 -t 1 | mscond 3
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint.
'mscond k | msfs' yields the same output as 'mscondfs k'
fs3mom
calculates third moments for tabular data (e.g. frequency spectra).
Example: ms 5 1000 -t 1 | msfs | fs3mom
Option -mu yields the central moments.
Compare e.g. 'ms 5 10000 -t 1 | msfs | fs3mom' with 'nt3mom 5'
There is an analogous 'fs2mom' for the second moments in package 'ntx'.
M2fs
prints the analytical expected spectrum of exactly two nested mutations in the limit of vanishing theta using the formulas of Hobolth/Wiuf 2009.
msM2fs
filters ms-output and prints the frequency spectrum for every sample that contains exactly 2 nested mutations.
Smom
prints the second and third moments of the number of segregating sites S for given sample size n and theta (default 1).
Example: Smom 5 -t 10
testbias
computes the bias and skewness for 5 neutrality tests (like Tajima's D) using the approximating formulas given in the 3rd moments paper.
Example: testbias 10 -t 1
The analytical skewness can be compared with simulated data by using the sourceforge package 'ntx' (which computes the neutrality tests from 'ms' output).
Provided theta is known (and not estimated from the data, hence the additional parameter for ntx), it should be equal to the analytical result:
e.g. for Tajima's D
ms 10 100000 -t 1 | msfs | ntx -t 1 | cut -f 4 | fs3mom -skewness
or for Fay and Wu's H and theta = 100
ms 10 100000 -t 100 | msfs | ntx -t 100 | cut -f 10 | fs3mom -skewness
m2diff
compares two square matrices of same size.
Example:
echo -e "1 1\n2 2\n2 1\n2 3" > m1
echo -e "1 1\n2 1\n2 1\n2 3" > m2
m2diff m1 m2
m3diff
compares two 3d-arrays.
m2elsum
computes the sum of matrix elements
Example:
nt2mom 5 | m2elsum
m3elsum
computes the sum of 3-d array elements
Example:
nt3mom 5 | m3elsum
mscut
filters ms-output. Retains only those segregating sites which positions fall into a specified interval
Example:
ms 10 1 -t 5 | mscut 0.3 0.5
mssub
subsets ms-output. Retains only the first n chromosomes.
Example:
ms 10 1 -t 5 | mssub 5
msTsub
filters or extracts subtrees of size k (needs ms with option -T) [experimental status].
ms2tikzhaplo
translates ms-output into a latex tikz-picture: aligned sequences as lines and derived mutations as balls on top.
Example:
ms 20 1 -t 10 | ms2tikzhaplo -h > mshaplo.tex
pdflatex mshaplo.tex
ms2tikztree
translates ms-output (option -T needed!) into a latex tikz-picture: draws tree with derived mutations on top.
Example:
ms 20 1 -t 10 -T | ms2tikztree -h > mstree.tex
pdflatex mstree.tex
ms2vcf
convert output from ms to vcf. The REF and ALT fields are filled by random nucleotides.
Note that ms uses the "infinite sites model", while a vcf file accommodates a sequence with finite sites. The position numbers of ms (which lie in the interval [0,1]) are multiplied by the "length" of the chromosome (and rounded subsequently) to yield the integer position numbers required by the vcf. The precision of the positions in ms output should be increased to its maximum value by setting "-p 16" in order to reduce the number of identical positions due to rounding.
Different segregating sites placed at the same position in the vcf constitute a violation of the "infinite sites model" on which ms simulation is based and it is unclear how to handle them correctly. The tool implements a quick-and-dirty work-around : in case of multiple SNPs obtaining the same position in the vcf, only the first one (lowest position in ms file) is retained.
convert output from ms to SweepFinder format. That format is a table with four columns,
containing the position of (bi-allelic) SNPs, the absolute number of "second" alleles in the sample and a "folded" flag which is 0 when the "second" allele can be regarded as ancestral and 1 if not.
convert output from ms to input for fastPHASE. Both formats are quite similar. Each two consecutive chromosomes of ms output constitute a diploid individual. The tool has an option to randomize the genotypes at each position (swaping the allele between the two chromosomes with probability 0.5), hence erasing phase information.
Example:
ms 10 2 -t 10 | ms2fastphase -r
Last edit: A. Klassmann 2020-08-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This package contains implemented formulas for the calculation of the second and third moments of mutations described by a coalescent. Several utilities parse the output of the population-genetic simulation program "ms".
The options listed here are not exhaustive. Further options are shown by calling the tools with option -h.
nt2mom
Prints the expected values of the second moments: E[xi_i xi_j].
Example: nt2mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs2mom
Optionally prints expection values conditional on a mutation of size k.
Example: nt2mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | colmean
This can be further restricted to nested or disjoint mutations with the options -n and -d
nt3mom
Prints the expected values of the third moments: E[xi_h xi_i xi_j].
Example: nt3mom 5
Comparison with simulations: ms 5 10000 -t 1 | msfs | fs3mom
Optionally prints second moments conditional on a mutation of size k.
Example: nt3mom 5 3
Comparison with simulations: ms 5 10000 -t 1 | mscondfs 3 | fs2mom
This can be further restricted to nested or disjoint mutations with the options -nn, -nd and -dd
nt3mom_slow
Only for comparison with nt3mom. Uses the unsimplified version of the formulas for the third moments and is hence very slow; yields the same results within rounding errors.
colmean
averages columns over consecutive rows.
Example: echo -e "10 20 \n5 8" | colmean
matmean
averages element-wise over consecutive square matrices.
Example: echo -e "1 1\n2 2\n2 1\n2 3" | matmean
msfs
calculates the standard frequency spectrum out of ms output.
Example: ms 20 1 -t 1 | msfs
ms2loci
calculates the 2-loci frequency spectrum from output of 'ms'.
Example: ms 6 1 -t 1 | ms2loci
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint
mscondfs
calculates the frequency spectrum from output of 'ms', conditional on a (possible multiple) existence of a mutation of size k
Example: ms 6 10 -t 1 | mscondfs 3
This can be further restricted to nested or disjoint mutations with options -n and -d
Compare:
ms 6 100000 -t 1 | mscondfs 3 | colmean
with
nt2mom 6 3
mscond
filters output of 'ms'. (re)prints (part of) the ms output conditional on mutations of size k.
Example: ms 6 10 -t 1 | mscond 3
This can be further restricted to nested or disjoint mutations with the options -nested respective -disjoint.
'mscond k | msfs' yields the same output as 'mscondfs k'
fs3mom
calculates third moments for tabular data (e.g. frequency spectra).
Example: ms 5 1000 -t 1 | msfs | fs3mom
Option -mu yields the central moments.
Compare e.g. 'ms 5 10000 -t 1 | msfs | fs3mom' with 'nt3mom 5'
There is an analogous 'fs2mom' for the second moments in package 'ntx'.
M2fs
prints the analytical expected spectrum of exactly two nested mutations in the limit of vanishing theta using the formulas of Hobolth/Wiuf 2009.
msM2fs
filters ms-output and prints the frequency spectrum for every sample that contains exactly 2 nested mutations.
Smom
prints the second and third moments of the number of segregating sites S for given sample size n and theta (default 1).
Example: Smom 5 -t 10
testbias
computes the bias and skewness for 5 neutrality tests (like Tajima's D) using the approximating formulas given in the 3rd moments paper.
Example: testbias 10 -t 1
The analytical skewness can be compared with simulated data by using the sourceforge package 'ntx' (which computes the neutrality tests from 'ms' output).
Provided theta is known (and not estimated from the data, hence the additional parameter for ntx), it should be equal to the analytical result:
e.g. for Tajima's D
ms 10 100000 -t 1 | msfs | ntx -t 1 | cut -f 4 | fs3mom -skewness
or for Fay and Wu's H and theta = 100
ms 10 100000 -t 100 | msfs | ntx -t 100 | cut -f 10 | fs3mom -skewness
m2diff
compares two square matrices of same size.
Example:
echo -e "1 1\n2 2\n2 1\n2 3" > m1
echo -e "1 1\n2 1\n2 1\n2 3" > m2
m2diff m1 m2
m3diff
compares two 3d-arrays.
m2elsum
computes the sum of matrix elements
Example:
nt2mom 5 | m2elsum
m3elsum
computes the sum of 3-d array elements
Example:
nt3mom 5 | m3elsum
mscut
filters ms-output. Retains only those segregating sites which positions fall into a specified interval
Example:
ms 10 1 -t 5 | mscut 0.3 0.5
mssub
subsets ms-output. Retains only the first n chromosomes.
Example:
ms 10 1 -t 5 | mssub 5
msTsub
filters or extracts subtrees of size k (needs ms with option -T) [experimental status].
ms2tikzhaplo
translates ms-output into a latex tikz-picture: aligned sequences as lines and derived mutations as balls on top.
Example:
ms 20 1 -t 10 | ms2tikzhaplo -h > mshaplo.tex
pdflatex mshaplo.tex
ms2tikztree
translates ms-output (option -T needed!) into a latex tikz-picture: draws tree with derived mutations on top.
Example:
ms 20 1 -t 10 -T | ms2tikztree -h > mstree.tex
pdflatex mstree.tex
ms2vcf
convert output from ms to vcf. The REF and ALT fields are filled by random nucleotides.
Note that ms uses the "infinite sites model", while a vcf file accommodates a sequence with finite sites. The position numbers of ms (which lie in the interval [0,1]) are multiplied by the "length" of the chromosome (and rounded subsequently) to yield the integer position numbers required by the vcf. The precision of the positions in ms output should be increased to its maximum value by setting "-p 16" in order to reduce the number of identical positions due to rounding.
Different segregating sites placed at the same position in the vcf constitute a violation of the "infinite sites model" on which ms simulation is based and it is unclear how to handle them correctly. The tool implements a quick-and-dirty work-around : in case of multiple SNPs obtaining the same position in the vcf, only the first one (lowest position in ms file) is retained.
Example:
ms 10 2 -t 10 -p 16 | ms2vcf -length 1000 > msout.vcf
Type ms2vcf -h for other options.
ms2SF
convert output from ms to SweepFinder format. That format is a table with four columns,
containing the position of (bi-allelic) SNPs, the absolute number of "second" alleles in the sample and a "folded" flag which is 0 when the "second" allele can be regarded as ancestral and 1 if not.
Example:
ms 10 2 -t 10 -p 16 | ms2SF -length 1000 > msout.SF
ms2fastphase
convert output from ms to input for fastPHASE. Both formats are quite similar. Each two consecutive chromosomes of ms output constitute a diploid individual. The tool has an option to randomize the genotypes at each position (swaping the allele between the two chromosomes with probability 0.5), hence erasing phase information.
Example:
ms 10 2 -t 10 | ms2fastphase -r
Last edit: A. Klassmann 2020-08-12