mimicree Wiki

Brought to you by: rokofler

ManualGeneral

Authors: Anonymous

Introduction
Specifications
- Considered
- Not considered
File formats
- Haplotypes (--haplotypes-g0)
- Inversions (--inversions-g0)
- Section One: Inversion definition
- Section Two: Population descritpion
- Recombination rate (--recombination-rate)
- Chromosome definition (--chromosome-definition)
- Additive fitness coefficients (--additive)
- Heterozygous effect in detail
- Epistasis (--epistasis)
- Syntax for the definition
- Syntax for a SNP
- Scoring
- Threads (--threads)
- Output mode (--output-mode)
Details about the algorithm
- Fitness function
- Mating function
- Recombination function

Introduction

The following section introduces the core functionality of MimcrEE and the used file formats

Specifications

Considered

Additive genetic effects
Dominance (including over and underdominance)
Epistasis (between haplotypes)
Haplotypes
Recombination rate
Inversions
Snapshots of the evolving population at defined time points (generations)
n-replicate runs of the evolving population

Not considered

sex of the fly (assuming hermaphrodites but excluding selfing)
fluctuating population size
overlapping life cycles
the following genetic variation: TE insertions, indels, microsatellites, CNV
novel mutations
gene conversion (biased)
epistasis within haplotypes; interactions of epistasis with dominance effects

File formats

Haplotypes (--haplotypes-g0)

The haplotypes at generation zero of the simulations, for example:

2L      686891    G      G/A    AG GG GG GG GG
2L      936681    A      A/G    GG AA AA AA AA
2L      861026    T      A/T    TT AT AA AA TT
2L      966618    C      T/C    TC TC TT TT TT
2R      134298    A      A/C    AC AC CC CC CC

The file consists of exactly five columns which are separated by a 'tab'!

col1: chromosome
col2: position
col3: reference character
col4: allelels of the SNP, in the foramt "major-allele/minor-allele"
col5: a space separated list of genotypes. The first genotype refers to the first specimen, the second to the second specimen and so on..

Note: the popualtion size will be infered from col5, i.e: the population size equals the number of provided genotypes

Note: haplotypes will be inferred from the order of the alleles in the genotpes, i.e.: the first allele always referes to the first haplotype and consequently the second to the second haplotype. For example the first specimen at chromosome 2L has the two haplotypes {{AGTT}} & {{GGTC}}

Inversions (--inversions-g0)

The inversion frequencies at generation zero of the simulations. The file consists of two parts, the first part is the definition of the inversions and the second part indicate the absence/presence of the inversions for each specimen. For example:

> Inversion definition
In2 = 3R:1023334-11454555
InP = 3R:14000232-18000455
In3 = 2L:1200000-1500000
> Population description
1     In2,InP:In2
2
3     In3:InP,In3
4     In2:InP
5     In2,InP:In2,InP
6     In3,In2,InP:In3,In2,InP
8     -:-
9     InP:-

The two parts of the inversion file are initiated by the symbol {{>}}. The first section will always be treated as {{Inversion definition}} and the second as definition of the Population

Section One: Inversion definition

Definition of the inversions must be in the format: {{name=chr:start-end}} where space will be ignored.

name: the name of the inversion, only use alphanumeric characters and numbers
chr: the chromosome of the inversion, only use alphanumeric characters
start: the start position of the inversion, only use digits
end: the end position of the inversion, only use digits

Section Two: Population descritpion

For every specimen the presence of the previously defined inversions need to be provided. If nothing is provided the default of absence of inversions will be assumed. This section consists of two tab-separted columns:

col1: the number of specimen. This number refers to the ordering in the haplotype file (--haplotype-g0), e.g.: the specimen with number one will refer to the first specimen in the haplotype file (in col5 the first row).
col2: a list of inversions present in the specimen. All separted list of inversions being present in the specimen; As diploid specimen will be simulated it is nesessary to provide two inversion 'haplotypes', which are separated by the symbol ':'; The symbols '-' signifies absence of a inversion, which is the default state (thus '-:-' is redundant). The default state also applies to individuals not provided in the population description (e.g.: 7)

*Note Every inversion used in the {{population description}} must have been previously defined in {{Inversion definition}}. Note Inversions within inversions are not allowed. Thus overlapping inversions will result in an error. Note* The ordering of the haplotypes is not considered. For every specimen a single inversion genotype should be specified. However in case multiple definitions are present for the same specimen (eg number 8), only the last definition will be used. A inversion genotype may also be entirely missing, in which case MimicrEE assumes no inversions.

Recombination rate (--recombination-rate)

A file of the following format can be obtained from: http://petrov.stanford.edu/cgi-bin/recombination-rates_updateR5.pl

2L:0..100000            0.00            0.00            0.00
2L:100000..200000       0.00            0.00            0.00
2L:200000..300000       0.00            0.00            1.89

col1: genomic locus in the {{form chromosome:start..end}}
col2: recombination rate at the beginning of the window
col3: recombination rate in the middle of the window
col4: recombination rate in the end of the window

Note: Mimicree will only read the recombination rate in the middle of the window and than assume an uniform recombination rate for the window. However, to increase the resolution you may decrease the window size. Note: MimicrEE operates with 1/2 the recombination rate provided in the above file, as males do not recombine. (MimicrEE just operates with hermaphrodites)

Chromosome definition (--chromosome-definition)

This parameter is necessary to accurately simulate the recombination landscape. In nature all chromosomes show random assortment, however to simplify analysis of the data some chromosomes have been artificially split. For example chromosome {{2}} of //D. melanogaster// has been split into {{2L}} and {{2R}}

Chromosome definitions have to be provided in the following format: {{new_chr=chr_part_1+chr_part_2,newchr=chr_part1+chr_part_2,...}} For example: {{2=2L+2R,3=3L+3R}}.

new_chr: name of the novel chromosome, ie: the full length entity
chr_part_1: name of the first part
chr_part_2: name of the second part
chr_part_3: name of the third part and so on

Note: It is essential that you provide the chromosome parts in the correct order. For example {{2=2R+2L}} would be wrong, this would concatenate the telomeric region of {{2R}} with the centromeric region of {{2L}}.

Note: It is possible to provide more than two parts of chromosomes eg: {{3=3L+3M+3R}}

Additive fitness coefficients (--additive)

Additive fitness effects of the genotypes of SNPs will be modeled as relative viabilities ({{w}}). Viability can be imagined as the probability of survival. The additive effects are based on the following equations

w_{11} = 1

w_{12} = 1 + hs

w_{22} = 1 + s

Where {{s}} is the selection coefficient, {{h}} the heterozygous effect and {{w}} the relative viabilities (of homozygous or heterozygous individuals). This information needs to be provided for non-neutral SNPs in the following format

2L     12     A     0.5     0.5
2L     17     C     1.0     0.0
2L     22     C     -0.5     0.5
2R     55     T     -0.2     -1.2

col1: chromosome
col2: position
col3: nucleotide of which the homozygous state constitutes {{w11}}
col4: selection coefficient ({{s}})
col5: heterozygous effect ({{h}})

Note The selection coeffiecient must be between -1 and +infinite, whereas the heterozygous effect may be larger or smaller than one.

Note The nucleotide provided in {{col3}} must be either the major or the minor allele provided in {{--haplotypes-g0}}. This will be strictly enforced.

Note Per default a selection coefficient of s=0 will be assumed for all SNPs not present in this file

Heterozygous effect in detail

~ h
~ effect

0
A1 is dominant

1
A2 is dominant

-1
underdominance

0<h<1
intermediate effect

2
overdominance

Epistasis (--epistasis)

Similar to the additive effects a selection coefficient ({{s}}) needs to be specified for epistatic effects. The following definition only allow to specify synergistic and antagonistic epistasis. In a two-locus interaction this type of epistasis would result in a phenotypic ratio of 9:7, other phenotypic ratios than this can not be modeled.

w_e=1+s

Following an example of an input file for epistatic effects

>e1:-0.2
2L     102     T
2L     117     C
>epistatic_effect2:0.9
2R     1024     T
2R     8988     G
3L     7811     A

For every epistatic effect the file has two parts, similar to a fasta sequence. First the definition of the epsistatic effect starting with a {{>}} and second the individual SNPs contributing to the epistatic effect.

Syntax for the definition

The definition line of an epistatic effect has the following syntax {{>name:s}}

name: assign some name or number to the epistatic effect
s: selection coefficient; just like for additive SNPs. If the epistatic effect is found the specimen gets a fitness of {{1-s}} assigned, otherwise a fitness of {{1}} will be assigned. Note that a negative {{s}} is thus actually a positive fitness effect.

Syntax for a SNP

The definition of a SNP contributing to an epistatic effect has three columns:

col1: chromosome
col2: position
col3: allele contributing to the epistatic effect

Scoring

An epistatic effect will be assumed to be present when the alleles of the epistatic SNPs are present at least once in a diploid genome. This epistatic effects are thus assumed to be dominant. Using this definition of epistasis it is possible to model synergistic and antagonistic epistasis.

Threads (--threads)

Number of replicate runs that should be processed in parallele; Memory consumption increases linearly with increasing number of replicate runs. Furthermore setting threads >1 when only a single replicate run has been chosen will not result in an increased processing speed.

Output mode (--output-mode)

May either be:

a coma separated list of output generations, e.g.: 10,20,30,40,50,60,70,80,90,100; MimicrEE will save the haplotypes and related information at these generations
or the term 'fixselected' followed by a number, e.g.: fixselected10; MimicrEE will proceed with the simulations until all selected SNPs (additive or epistatic) are fixed and the output will be stored all 'n' (eg 10) generations. The last generation will be stored as well.

Details about the algorithm

Fitness function

Fitness is calculated as the result of additive effects and epistatic effects

The additive fitness effect is

w_a=\prod_{i}w_{ai}

The epistatic fitness effect is treated as an additional layer on top of the additive effects

w_e=\prod_{i}w_{ei}

the fitness in total is calculated as

w=w_{a}*w_{e}

Mating function

The fitness will directly reflect the probability of mating. There is a linear relationship between the probability of mating and the fitness. Self-matings are excluded and all individuals are hermaphrodites, i.e.: they may mate with any other individual.

Recombination function

Note: MimicrEE operates with 1/2 the recombination rate provided in the file. The recombination rate in D.mel is for females. Males are not recombining. In MimicrEE we are just simulating hermaphrodites. The issue of no recombination in males is thus addressed by multiplying the recombination rate by 1/2.

The Kosambi function will be used to calculate 'r' (recombinaton fraction) for a given window. Online resources usually provide the recombination rate in centi-Morgen per million base pairs (cM/Mb). It is thus necessary to translate this recombination fraction per million base pairs into the recombination fraction per window (of the given size). Note that simply dividing the recombination rate per Mb by the number of windows per Mb will not work as this approach does not consider double cross overs. The conversion from 'rMb' to 'rwin' is done in three steps.

the map distance per Mb is calculated (dMb) with the inverse of the mapping function (Kosambi function)
the map distance per window is calculated
the recombination fraction per window is calculated with the mapping function (Kosambi function)

d_{Mb}=\frac{1}{4}*ln( \frac{1+2r_{Mb}}{1-2r_{Mb}})

d_{win}=\frac{d_{Mb}}{c_{win}}

r_{win}=\frac{e^{4d_{win}}-1}{2*(e^{4d_{win}}+1)}
dMb: map distance per mega base pair (Mb)
rMb: recombination fraction per Mb
cwin: number of windows per Mp (=1,000,000/windowsize)
dwin: map distance for one window
rwin: recombination fraction for one window

We tested this approach with simulations. The following table lists the number of recombination events for a window of the given size with the given recombination rate. 10,000 simulations have been performed. Note that 1cM corresponds to 1% of recombinants, which is 100 for 10,000 simulations.

~ win.size
~ rec.rate (cM/Mbp)
~ recombination events

1000k
0
0

1000k
4
403

100k
4
43

1000k
20
2018

100k
20
217

10k
20
19

1000k
40
3968

1000k
50
0! error

Note: max. recombinaton rate is 45 cM/Mbp!

Wiki: Manual