Menu

Tutorial

Anonymous Robert Kofler
  • Introduction
  • Requirements
  • Data
  • Walkthrough
    • Prepare the reference genome
    • Map the reads to the reference genome
    • Remove ambiguously mapped reads
    • Create a synchronized file
    • Calculate allele frequency differences
    • Fst-values: measure differentiation between populations
    • Calculate Fst for every SNP
    • Calculate Fst values using a sliding window approach
    • Load Fst-values into IGV
    • Fisher's Exact Test: estimate the significance of allele frequency differences
    • Load the Fisher's exact test results into the IGV
    • Cochran-Mantel-Haenszel test: detect consistent allele frequency changes in several biological replicates
    • Display the cmh-test results in the IGV
    • Calculate Fst for genes
  • After PoPoolation2

Introduction

PoPoolation2 facilitates comparison of allele frequencies between different populations from pooled next generation sequencing data. In this walkthrough the main functionality of PoPoolation2 will be demonstrated by comparing the allele frequencies of two populations.

The sample data set used in this walkthrough consists of simulated fastq reads derived from the first 1.4 Mbp of chromosome 2R of D.melanogaster. Two pooled populations have been simulated each having a coverage of 100. In total 10.000 SNPs, located in unambiguous regions, have been introduced. We simulated different allele frequencies for these SNPs in the two populations, with differences ranging from 0.1 to 0.9. We also used a minimum distance of 100 bp between two successive SNPs. The simulated reads have a length of 75 bp and a sequencing error rate of 1%.

Requirements

Data

The sample data can be downloaded here: https://sourceforge.net/projects/popoolation2/files/demo-data.zip

Subsequently unzip the archive.

Walkthrough

Prepare the reference genome

Open the command line and change directory to the unziped folder containing the sample data of PoPoolation2

mkdir ref
mv 2R.chr ref
bwa index ref/2R.chr

Map the reads to the reference genome

mkdir map
bwa aln -n 0.01 -l 100 -o 1 -d 12 -e 12 -t 8 ref/2R.chr pop1.fastq > map/pop1.sai
bwa aln -n 0.01 -l 100 -o 1 -d 12 -e 12 -t 8 ref/2R.chr pop2.fastq > map/pop2.sai
bwa samse ref/2R.chr map/pop1.sai pop1.fastq > map/pop1.sam
bwa samse ref/2R.chr map/pop2.sai pop2.fastq > map/pop2.sam

Remove ambiguously mapped reads

samtools view -q 20 -bS map/pop1.sam | samtools sort - map/pop1
samtools view -q 20 -bS map/pop2.sam | samtools sort - map/pop2

Create a synchronized file

Synchronized files are the main input files for PoPoolation2. They basically contain the allele frequencies for every population at every base in the reference genome in a concise format. Note that the synchronized file format contains the allele frequencies after filtering for base quality.

samtools mpileup -B map/pop1.bam map/pop2.bam > p1_p2.mpileup
perl <popoolation2-path>/mpileup2sync.pl --fastq-type sanger --min-qual 20 --input p1_p2.mpileup --output p1_p2.sync

Synchronizing the mpileup file is quite time consuming. To remove this bottleneck we implemented 'mpileup2sync' in Java multi-threading which is about 78x faster as the implementation in perl.

java -ea -Xmx7g -jar <popoolation2-path>/mpileup2sync.jar --input p1_p2.mpileup --output p1_p2_java.sync --fastq-type sanger --min-qual 20 --threads 8

Sample of a synchronized file:

2R  2302    N   0:7:0:0:0:0 0:7:0:0:0:0
2R  2303    N   0:8:0:0:0:0 0:8:0:0:0:0
2R  2304    N   0:0:9:0:0:0 0:0:9:0:0:0
2R  2305    N   1:0:9:0:0:0 0:0:9:1:0:0
  • col1: reference contig
  • col2: position within the refernce contig
  • col3: reference character
  • col4: allele frequencies of population number 1
  • col5: allele frequencies of population number 2
  • coln: allele frequencies of population number n

The allele frequencies are in the format A:T:C:G:N:del, i.e: count of bases 'A', count of bases 'T',... and deletion count in the end (character '*' in the mpileup)

Calculate allele frequency differences

The exact allele frequency differences may be computed as follows:

perl <popoolation2-path>/snp-frequency-diff.pl --input p1_p2.sync --output-prefix p1_p2 --min-count 6 --min-coverage 50 --max-coverage 200

This script creates two output files having two different extensions:

  • _rc: this file contains the major and minor alleles for every SNP in a concise format
  • _pwc: this file contains the differences in allele frequencies for every pairwise comparision of the populations present in the synchronized file
    For details see the man pages of the script

The allele frequency differences can be found in the _pwc file, a small sample:

##chr   pos     rc      allele_count    allele_states   deletion_sum    snp_type        most_variable_allele    diff:1-2
2R      4459    N       2       C/T     0       pop     T       0.133
2R      9728    N       2       T/C     0       pop     T       0.116

The last column contains the obtained differences in allele frequencies for the allele provided in column 8. Note that in this example the last column refers to a pairwise comparision between population 1 vs 2, in case several populations are provided all pairwise comparisions will be appended in additional columns. For a comparison of the observed allele frequency differences with the expected ones please see: Validation

Fst-values: measure differentiation between populations

Calculate Fst for every SNP

perl <popoolation2-path>/fst-sliding.pl --input p1_p2.sync --output p1_p2.fst --suppress-noninformative --min-count 6 --min-coverage 50 --max-coverage 200 --min-covered-fraction 1 --window-size 1 --step-size 1 --pool-size 500

the option --suppress-noninformative suppresses output for windows not containing a SNP. When applied to windows of size one, this option suppresses output for bases that are no SNP.

For a validation of the Fst values please see Validation

Calculate Fst values using a sliding window approach

perl <popoolation2-path>/fst-sliding.pl --input p1_p2.sync --output p1_p2_w500.fst --min-count 6 --min-coverage 50 --max-coverage 200 --min-covered-fraction 1 --window-size 500 --step-size 500 --pool-size 500

Load Fst-values into IGV

First index the two bam files

samtools index map/pop1.bam
samtools index map/pop2.bam

Than convert the file containing the Fst-values into the .igv format

perl <popoolation2-path>/export/pwc2igv.pl --input p1_p2.fst --output p1_p2.igv

Finally load the results into the IGV

  • Start IGV
  • Import Genome -> Load the fasta file: 2R.fasta
  • Load the two .bam files
  • Load the .igv file containing the Fst values

Voila, the result should look something like this:

Note that the result shown above only contain the Fst for a single pairwise comparison (pop1 vs pop2). PoPoolation2 is per default calculating the Fst for all possible pairwise comparisons between populations (when several populations are present in the synchronized file), and these results may just as easily be converted into the .igv format. When loaded into IGV it may look like the following:

Fisher's Exact Test: estimate the significance of allele frequency differences

The Fishers exact test can be used to test whether any differences in allele frequencies are statistically significant. At low coverages the absolute changes of allele frequencies or the Fst values may be strongly influenced by sampling effects, therefore the Fishers exact test may be used to identify significant changes in allele frequency.

perl <popoolation2-path>/fisher-test.pl --input p1_p2.sync --output p1_p2.fet --min-count 6 --min-coverage 50 --max-coverage 200 --suppress-noninformative

Note that this script provides -log10(P-value) as output.

Load the Fisher's exact test results into the IGV

First convert the Fisher's exact test results into the .igv file format:

perl <popoolation2-path>/export/pwc2igv.pl --input p1_p2.fet --output p1_p2_fet.igv

Than start the IGV, load the .bam alignments and the file: p1_p2_fet.igv. The result should be very similar as in the following example:

Cochran-Mantel-Haenszel test: detect consistent allele frequency changes in several biological replicates

The cmh-test (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/mantelhaen.test.html) is used for detecting significant and consistent changes in allele frequencies when independent measurements of the allele frequencies have been obtained (e.g.: biological replicates). For example when you are interested in the SNPs responsible for plant height, you may obtain several measurements of the allele frequencies of tall and small plants from different ecological regions. Using the cmh-test you may want to detect consistent differences (consistent between small and tall) in allele frequency for the following comparisions: small_alaska vs tall_alaska + small_norway vs tall_norway + small_sibiria vs tall_sibiria

In this walkthrough it is necessary to create a data set having a biological replicate. We are just duplicating the last two columns of the synchronized file used above:

mkdir cmh
cat p1_p2.sync|awk 'BEGIN{OFS="\t"}{print $0,$4,$5}' > cmh/p1_p2_p1_p2.sync

Here is an example of the result:

2R  2298    N   3:0:0:0:0:0 3:0:0:0:0:0 3:0:0:0:0:0 3:0:0:0:0:0
2R  2299    N   0:4:0:0:0:0 0:4:0:0:0:0 0:4:0:0:0:0 0:4:0:0:0:0
2R  2300    N   5:0:0:0:0:0 5:0:0:0:0:0 5:0:0:0:0:0 5:0:0:0:0:0

Compare the allele frequencies of popualtion 1 with population 2 and of population 3 with population 4 using the cmh-test.

perl <popoolation2-path>/cmh-test.pl --input p1_p2_p1_p2.sync --output p1_p2_p1_p2.cmh --min-count 12 --min-coverage 50 --max-coverage 200 --population 1-2,3-4

Note that it is necessary to double the --min-count as we also doubled the number of populations. For a validation of the cmh-test please see Validation

Display the cmh-test results in the IGV

First convert the results of the cmh-test into the .gwas format of IGV.

perl <popoolation2-path>/export/cmh2gwas.pl --input p1_p2_p1_p2.cmh --output p1_p2_p1_p2.gwas --min-pvalue 1.0e-20

Note that IGV has some problems displaying very low p-values, thus the option --min-pvalue was used introduced as threshold. P-values smaller than the threshold will be set to the value of the threshold.

Open the IGV as described before and load the file p1_p2_p1_p2.gwas. The results should be similar to the following example

Calculate Fst for genes

Download the annotated exons for the first 1mio bp of chromosome 2R http://popoolation2.googlecode.com/files/2R_exons.gtf

Convert the synchronized file into a gene-based synchronized file

perl <popoolation2-path>/create-genewise-sync.pl --input p1_p2.sync --gtf 2R_exons.gtf --output p1_p2_genes.sync

Calculate the Fst for every gene:

perl <popoolation2-path>/fst-sliding.pl --min-count 6 --min-coverage 50 --max-coverage 200 --pool-size 500 --min-covered-fraction 0.0 --window-size 1000000 --step-size 1000000 --input p1_p2_genes.sync --output p1_p2_genewise.fst

The output should be similar to the following example:

CG17683-RB  500000  17  0.002   100.0   1:2=0.01349929
CG40127-RA  500000  5   0.001   99.7    1:2=0.01897121
CG17665-RB  500000  27  0.003   99.9    1:2=0.03149757

The output above indicates that gene CG17683-RB has an Fst of 0.01349929, gene CG40127-RA an Fst of 0.01897121 and so on. Of course this analysis may also be conducted with more than two populations, in which case again all pairwise comparisons will be calculated.

Similarly the fisher's exact test may be performed for every gene, or the CMH-test for the SNPs of every gene.

Note: this analysis may be performed for any feature (intron, cds etc) as long as it is present in the gtf file and has a unique gene_id

After PoPoolation2

What to do now? You have successfully identified several hundreds SNPs being associated with a trait of interest (or environment) but what is their biological meaning? You may use Gowinda to test whether your SNPs show an enrichment for any GO category (or any gene set): http://code.google.com/p/gowinda/


Related

Wiki: Home
Wiki: Main
Wiki: Validation

Discussion

  • Anonymous

    Anonymous - 2013-03-07

    Originally posted by: perrica...@hotmail.com

    Hi, I have just tried to convert the synchronized file into a gene-based synchronized file, and the command runs, but the output file is empty. Any ideas? The input sync file works fine with other commands. Perhaps it is the gtf file, but looking at the example gft file on this page it looks to be fine. Thanks,

     
  • Anonymous

    Anonymous - 2013-04-30

    Originally posted by: braud.ma...@gmail.com

    There is a problem of column in the demo files. The command: samtools view -q 20 -bS map/pop1.sam | samtools sort - map/pop1 does not manage to create the bam file because of the value XA:Z: in the sam file, the error is: Parse error at line 36: missing colon in auxiliary data

    What is the problem?

     
  • Anonymous

    Anonymous - 2013-05-06

    Originally posted by: abh...@gmail.com

    This is a great tool! I wanted to know if there is any way to get the SNPs in "_rc file" in VCF format? to be used in outlier detecting programs _

     
  • Anonymous

    Anonymous - 2014-05-06

    Originally posted by: elzedliu

    I am wondering if you guys can let me know a way to prepare input file for bayescan after popoolation?

     
  • Anonymous

    Anonymous - 2014-12-02

    Originally posted by: alicebde...@gmail.com

    Hi, I'm trying to use the cmh-test above, but should something in the cmd below instead read p1_p2_p3_p4? I'm confused where the second half of the data comes into this!

    perl <popoolation2-path>/cmh-test.pl --input p1_p2_p1_p2.sync --output p1_p2_p1_p2.cmh --min-count 12 --min-coverage 50 --max-coverage 200 --population 1-2,3-4

    Thanks

     
  • Anonymous

    Anonymous - 2015-01-02

    Originally posted by: amritaya...@gmail.com

    hello, I wanted to know if there is any way to get the SNPs in rc file in VCF format?

     
  • Anonymous

    Anonymous - 2015-01-05

    Originally posted by: chengjie...@gmail.com

    Hi,I am trying the fst for SNPs between 2 populaitons using the fst-sliding.pl . I am bit confused about the "min-coverage", for example, if I have 2 populations and set the min-coverage of 5, is that means 5 reads are required for each population at this SNP, or 10 reads for the 2 populaitons but can be 8:2, 7:3 ....for each?

    Thank you!

     
  • Anonymous

    Anonymous - 2015-02-10

    Originally posted by: beatrizj...@gmail.com

    Hi everyone, Fisrt, thanks for the tool, it has been a great finding for my research. I'm trying to understand how fst_slinding.pl calculates the FST value. In the --help it says how it calculates, but I try to do it by myself and I have been not able to do it. I think it makes a correction by pool-size, but I don't know how? I'm using the script with an step size of 1. Thanks

     
  • Karol Cichewicz

    Karol Cichewicz - 2017-01-25

    Hi everyone. I'm trying to calculate fst and Fisher's p-values. These scripts take hours to run. I let the fst script go for over 20h on my university cluster, and it couldn't even get through one chromosome of data. The fisher-test.pl takes similarly long, but it doesn't save any information in its output file. My pools/bulks are composed of 45 and 44 individuals with average read depth of ~150. Is it the expected behvior or am I doing something wrong?

    Examples of my scripts:
    perl fisher-test.pl \ --input Low_vs_High_w_ref.mpileup.sync \ --output Low_vs_High-fisher.fet \ --min-count 5 --min-coverage 50 \ --step-size 1000 --window-size 1000 \ --max-coverage 800 --suppress-noninformative

    perl fst-sliding.pl \ --input /Low_vs_High_w_ref.mpileup.sync \ --output /Low_vs_High.fst \ --min-count 10 --min-coverage 50 --max-coverage 800 --window-size 1 --step-size 1 --pool-size 45:44

     
    • Robert Kofler

      Robert Kofler - 2017-01-25

      Well thats lots of data. the two scripts fst and fet compute all pairwise comparisions and thats a lot for 45 to 44 individuals... I would try it with less individuals, and once this works scale it up to more.

       
      • Karol Cichewicz

        Karol Cichewicz - 2017-01-25

        Thank your for a suggestion, but I think I should explain it better. My data contains pooled sequencing data. Aligned reads of individuals, with separate alignment for each individual, were merged by samtools merge, then realigned using GATK, and variants were called by samtools mpileup. Thus, the sync file does not contain information about variant calling in each individual, but in two pools. How would that be different from running these computations for just two samples? Should I change the --pool-size parameter to 1:1 then? I run Fisher exact test using plink on exactly the same dataset, but preserving the information about individuals, and it was much faster.

         
  • James Reeve

    James Reeve - 2018-05-29

    I'm trying to access the data for the tutorial, however the link appears to be dead. I've tried accessing it with two seperate browsers. How do I access the data?

     
  • David Rinker

    David Rinker - 2019-03-20

    Really appreciate having this software availalbe! Currently I'm using it for BSA in fruitfly.

    In computing Fst I'm confused as to what "pool-size" means. I've seen it mentioned as referring to both individuals and to number of chromosomes (and can see arguments for both). Also, what is the correct syntax for applying it when pools are unequal?--a poster above uses a ":" delimiter so is that correct? And if so am I safe in assuming that "x:y" would apply to the pools in the order in which they were passed into ".sync" file?

     
  • Maria Cortazar Chinarro

    Please, can you upload the igv png images to check whether my results look ok?
    I cannot acces to thouse pictures:

    "Voila, the result should look something like this:

    Note that the result shown above only contain the Fst for a single pairwise comparison (pop1 vs pop2). PoPoolation2 is per default calculating the Fst for all possible pairwise comparisons between populations (when several populations are present in the synchronized file), and these results may just as easily be converted into the .igv format. When loaded into IGV it may look like the following:"
    Thank you very much,

     

    Last edit: Maria Cortazar Chinarro 2021-01-03
  • Jonas A Kengne

    Jonas A Kengne - 2021-03-02

    Hi , I downloaded the online tutorial on popoopulation2 and had some problems in its execution. Could anyone assist pls?

    problem

    1. created mpileup successfully

    2. created syn file successfully

    3. fisher-test.pl fails (stops on line 9 and 10)

    Exact message from the terminal:

    poolseq_test$: perl popoolation2_1201/fisher-test.pl --input p1_p2.sync --output p1_p2.fet --min-count 6 --min-coverage 50 --max-coverage 200 --suppress-noninformative

    Can't locate Text/NSP/Measures/2D/Fisher/twotailed.pm in @INC (you may need to install the Text::NSP::Measures::2D::Fisher::twotailed module) (@INC contains: /home/Jako/Documents/poolseq_test/popoolation2_1201 /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules/FET.pm line 10.

    BEGIN failed--compilation aborted at /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules/FET.pm line 10.

    Compilation failed in require at popoolation2_1201/fisher-test.pl line 9.

    BEGIN failed--compilation aborted at popoolation2_1201/fisher-test.pl line 9.

    Looking forward to hearing from you or any other person willing to help.

    many thanks in advance

     

    Last edit: Jonas A Kengne 2021-03-02
    • Mr Schweitzer

      Mr Schweitzer - 2021-04-04

      Hello Mr. Kengne,
      I had the same problem and I managed to get the perl script running by following this solution in the comments here:
      https://code.google.com/archive/p/popoolation2/issues/19

      I hope this helps!

       
  • Atikah

    Atikah - 2021-06-09

    Hi everyone,

    I am using Popoolation2 sofware to calculate allele frequency and Fst now. But when I calculate allele difference of my populations, it generates several locus with 1 allele count (only one allele state). Commonly SNPs is a biallele or in some cases it can contain 3 alelles or 4 alleles. And in the minimum coverage, I set it 10 but in the rc file it generates the SNP with coverage less than 10. Do you think there is an error that happening when I use the Popoolation2?
    I am really appreciate if you can help me figured out what is happening. Thank you.

     
    • Atikah

      Atikah - 2021-06-09

      I just found out that it was rc SNPs not pop SNPs type.

       
  • Hafiz Muhammad Anas

    Hello Community
    Is it possible to annotate snps from snpeff eff tool after getting variants? which file i need to annoate?

     
  • Anonymous

    Anonymous - 2023-09-13

    Hi everyone,

    I noticed that when converting an mpileup file to a sync file, positions on a genome are shifted.
    For example, position 10,923,271 in a reference genome used for mapping was shifted to 10,923,753 in a sync file.
    Does anyone know how to resolve this error?

     
  • Lautaro Ezequiel Bennardo

    Hello everyone,
    I am using popoolation2 to perform the CMH test between 4 pairs of populations. The issue is that it's giving me very high p-values in general (very low actually), less than 10e-60. Does anyone know how I should choose the p-value cutoff? Or if there's any correction I should apply to these values? Thank you!

     
    • Mr Schweitzer

      Mr Schweitzer - 2023-11-07

      Hello Mr. Bennardo,
      I think the two most common ways of adressing multiple testing in GWAS-like data is either using a Bonferroni corrected p-value threshold or by using a False Discovery Rate (FDR) approach. The documentation for the R package "qvalue" is very well written:
      https://www.bioconductor.org/packages/release/bioc/manuals/qvalue/man/qvalue.pdf

       
      👍
      1

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.