Menu

DriftDistribution

Robert Kofler Christos Vlachos
Attachments
drift_120gens_4.png (19548 bytes)
drift_50gens_3.png (15928 bytes)

Validation for Random Genetic Drift

To check whether drift is correctly simulated in MimicrEE2 we performed simulations with 10,000 neutrally evolving loci.
Every locus was simulated on a distinct chromosome, hence all loci are unlinked. The starting allele frequency of all loci was 0.5

Simulations were performed for 50 generations (first figure) and 120 generations (second figure) using a population size of N=250. In the following figures the green histogram shows the observed allele frequencies and the the dark, dashed line to theoretical expectations using a Markov chain approach.


This demonstrates that drift is accurately simulated with MimicrEE2

As expected the allele frequencies diverge as we proceed forward in time and the proportion of fixed/lost alleles is increasing.

Materials and methods

The following pipeline was used to make the plots:

First simulations with MimicrEE2 were performed (all input files can be found here):

For 50 generations (first plot):

:::bash

java -jar mim2.jar qt --haplotypes-g0 250inds_basepop_10000loci --recombination-rate recombination_10000chrom --effect-size loci_effect_size --ve 0.00 --snapshots 50 --replicate-runs 1 --output-sync simulations_drift_50gens --selection-regime selection_regime --threads 6

For 120 generations (second plot):

:::bash
java -jar mim2.jar qt --haplotypes-g0 250inds_basepop_10000loci --recombination-rate recombination_10000chrom --effect-size loci_effect_size --ve 0.00 --snapshots 120 --replicate-runs 1 --output-sync simulations_drift_120gens --selection-regime selection_regime --threads 6

Then the following bash script was used to convert the output sync file to allele count proportions in the population (one sync file at a time).

Copy and paste the following commands in a file with .sh suffix and run it using bash.

:::bash
#!/bash/bin
##input file is the sync output file from mimicrEE2 (in zipped format)
sync_file="$1"

##output file is the proportion of allele counts in the population (only the frequency changes of one of the two alleles is measured)
outfile="$2"

gzip -dc "$sync_file" | cut -f 5 -d $'\t' | cut -f 1 -d ":" | sort | uniq -c |  sort -n -k2 | awk '{print $1/10000, $2}' > "$outfile"


bash myfile.sh inputfile outputfile

The R script used to compute drift expectations using a Markov chain model

Note: an error will be returned if:

  • the initial allele count is not an integer --> Markov chain model cannot be applied
  • the population size is too large --> the estimated probabilities are too low and R returns infinity (inf))

(in order to make the comparison the output file from the bash script above has to be provided):

:::R
library(ggplot2)


##function for estimating probabilty matrix
drift.prob.nexttgen<- function(proportion=c(), popsize){

  copies_A<-popsize
  copies_a<-popsize
  total_gametes=copies_A+copies_a

  #initialize matrix
   probabilities_matr<- matrix(NA,nrow=total_gametes+1, ncol=total_gametes+1)
   probabilities_matr[1,1]<- 1
   probabilities_matr[nrow(probabilities_matr),ncol(probabilities_matr)]<-1


   probabilities=c()

  #run binomial distribution to estimate probablities for every possible change
    for (j in 1:total_gametes-1){
      p=as.numeric(j/total_gametes)
      q=1-p

        if(p!=0){
      for (y in 0:total_gametes){


        prob<-choose(total_gametes,y) * p^y * q^(total_gametes-y)
        probabilities[y+1]<- abs(prob)

      }
        }
      if(length(probabilities)>0){
      probabilities_matr[j+1,]<-probabilities

      }


    }

  napos2<- which(is.na(probabilities_matr))
  probabilities_matr[napos2]<-0

  return(probabilities_matr)


    }


##run function for estimating probabilities for xxx population size and xxx generations
## make sure that the pop_size is less than or equal to 500, otherwise NAs will be returned

pop_size=250
total_allele_counts=2*pop_size
generations=50

allele_counts=total_allele_counts/2 ##e.g if pop_size=300 (total alleles=600) and denominator=10, then allele_counts=60
print(allele_counts)
allele_freq=allele_counts/total_allele_counts  ##then allele_freq=60/600=0.1
print(paste("Initial frequency of A allele is", allele_freq))

##distribution is polarizes for the rising allele
m<-t(m<-drift.prob.nexttgen(c(rep(0, times=allele_counts), 1,rep(0, times=(total_allele_counts)-allele_counts)), pop_size))
v<-c(rep(0, times=allele_counts), 1,rep(0, times=(total_allele_counts)-allele_counts))
for(i in 1:generations)
{
  v<-c(abs(m%*%v))
}

###plot drift distribution

drift_distr<- matrix(NA, ncol=2, nrow=length(v))
drift_distr[,1]<- v
drift_distr[,2]<- c(0:(length(v)-1))
colnames(drift_distr)<- c("Proportion", "Counts")
drift_distr<- as.data.frame(drift_distr)




##Load mimicrEE2 simulations for xx generations

setwd("/my/working/dir")
mimicree_counts<- read.table("file_output_from_bash_script")

colnames(mimicree_counts)<- c("Proportion", "Counts")

ggplot(mimicree_counts, aes(x=Counts, y=Proportion)) + geom_bar(stat="identity", color="forestgreen") + geom_line(data=drift_distr, linetype="dashed", size=1.1, color="grey6") + theme_bw()

Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.