To check whether drift is correctly simulated in MimicrEE2 we performed simulations with 10,000 neutrally evolving loci.
Every locus was simulated on a distinct chromosome, hence all loci are unlinked. The starting allele frequency of all loci was 0.5
Simulations were performed for 50 generations (first figure) and 120 generations (second figure) using a population size of N=250. In the following figures the green histogram shows the observed allele frequencies and the the dark, dashed line to theoretical expectations using a Markov chain approach.
This demonstrates that drift is accurately simulated with MimicrEE2
As expected the allele frequencies diverge as we proceed forward in time and the proportion of fixed/lost alleles is increasing.
The following pipeline was used to make the plots:
First simulations with MimicrEE2 were performed (all input files can be found here):
For 50 generations (first plot):
:::bash
java -jar mim2.jar qt --haplotypes-g0 250inds_basepop_10000loci --recombination-rate recombination_10000chrom --effect-size loci_effect_size --ve 0.00 --snapshots 50 --replicate-runs 1 --output-sync simulations_drift_50gens --selection-regime selection_regime --threads 6
For 120 generations (second plot):
:::bash
java -jar mim2.jar qt --haplotypes-g0 250inds_basepop_10000loci --recombination-rate recombination_10000chrom --effect-size loci_effect_size --ve 0.00 --snapshots 120 --replicate-runs 1 --output-sync simulations_drift_120gens --selection-regime selection_regime --threads 6
Then the following bash script was used to convert the output sync file to allele count proportions in the population (one sync file at a time).
Copy and paste the following commands in a file with .sh suffix and run it using bash.
:::bash
#!/bash/bin
##input file is the sync output file from mimicrEE2 (in zipped format)
sync_file="$1"
##output file is the proportion of allele counts in the population (only the frequency changes of one of the two alleles is measured)
outfile="$2"
gzip -dc "$sync_file" | cut -f 5 -d $'\t' | cut -f 1 -d ":" | sort | uniq -c | sort -n -k2 | awk '{print $1/10000, $2}' > "$outfile"
bash myfile.sh inputfile outputfile
The R script used to compute drift expectations using a Markov chain model
Note: an error will be returned if:
(in order to make the comparison the output file from the bash script above has to be provided):
:::R
library(ggplot2)
##function for estimating probabilty matrix
drift.prob.nexttgen<- function(proportion=c(), popsize){
copies_A<-popsize
copies_a<-popsize
total_gametes=copies_A+copies_a
#initialize matrix
probabilities_matr<- matrix(NA,nrow=total_gametes+1, ncol=total_gametes+1)
probabilities_matr[1,1]<- 1
probabilities_matr[nrow(probabilities_matr),ncol(probabilities_matr)]<-1
probabilities=c()
#run binomial distribution to estimate probablities for every possible change
for (j in 1:total_gametes-1){
p=as.numeric(j/total_gametes)
q=1-p
if(p!=0){
for (y in 0:total_gametes){
prob<-choose(total_gametes,y) * p^y * q^(total_gametes-y)
probabilities[y+1]<- abs(prob)
}
}
if(length(probabilities)>0){
probabilities_matr[j+1,]<-probabilities
}
}
napos2<- which(is.na(probabilities_matr))
probabilities_matr[napos2]<-0
return(probabilities_matr)
}
##run function for estimating probabilities for xxx population size and xxx generations
## make sure that the pop_size is less than or equal to 500, otherwise NAs will be returned
pop_size=250
total_allele_counts=2*pop_size
generations=50
allele_counts=total_allele_counts/2 ##e.g if pop_size=300 (total alleles=600) and denominator=10, then allele_counts=60
print(allele_counts)
allele_freq=allele_counts/total_allele_counts ##then allele_freq=60/600=0.1
print(paste("Initial frequency of A allele is", allele_freq))
##distribution is polarizes for the rising allele
m<-t(m<-drift.prob.nexttgen(c(rep(0, times=allele_counts), 1,rep(0, times=(total_allele_counts)-allele_counts)), pop_size))
v<-c(rep(0, times=allele_counts), 1,rep(0, times=(total_allele_counts)-allele_counts))
for(i in 1:generations)
{
v<-c(abs(m%*%v))
}
###plot drift distribution
drift_distr<- matrix(NA, ncol=2, nrow=length(v))
drift_distr[,1]<- v
drift_distr[,2]<- c(0:(length(v)-1))
colnames(drift_distr)<- c("Proportion", "Counts")
drift_distr<- as.data.frame(drift_distr)
##Load mimicrEE2 simulations for xx generations
setwd("/my/working/dir")
mimicree_counts<- read.table("file_output_from_bash_script")
colnames(mimicree_counts)<- c("Proportion", "Counts")
ggplot(mimicree_counts, aes(x=Counts, y=Proportion)) + geom_bar(stat="identity", color="forestgreen") + geom_line(data=drift_distr, linetype="dashed", size=1.1, color="grey6") + theme_bw()