PoPoolation2 facilitates comparison of allele frequencies between different populations from pooled next generation sequencing data. In this walkthrough the main functionality of PoPoolation2 will be demonstrated by comparing the allele frequencies of two populations.
The sample data set used in this walkthrough consists of simulated fastq
reads derived from the first 1.4 Mbp of chromosome 2R of D.melanogaster. Two pooled populations have been simulated each having a coverage of 100. In total 10.000 SNPs, located in unambiguous regions, have been introduced. We simulated different allele frequencies for these SNPs in the two populations, with differences ranging from 0.1 to 0.9. We also used a minimum distance of 100 bp between two successive SNPs. The simulated reads have a length of 75 bp and a sequencing error rate of 1%.
The sample data can be downloaded here: https://sourceforge.net/projects/popoolation2/files/demo-data.zip
Subsequently unzip the archive.
Open the command line and change directory to the unziped folder containing the sample data of PoPoolation2
mkdir ref mv 2R.chr ref bwa index ref/2R.chr
mkdir map bwa aln -n 0.01 -l 100 -o 1 -d 12 -e 12 -t 8 ref/2R.chr pop1.fastq > map/pop1.sai bwa aln -n 0.01 -l 100 -o 1 -d 12 -e 12 -t 8 ref/2R.chr pop2.fastq > map/pop2.sai bwa samse ref/2R.chr map/pop1.sai pop1.fastq > map/pop1.sam bwa samse ref/2R.chr map/pop2.sai pop2.fastq > map/pop2.sam
samtools view -q 20 -bS map/pop1.sam | samtools sort - map/pop1 samtools view -q 20 -bS map/pop2.sam | samtools sort - map/pop2
Synchronized files are the main input files for PoPoolation2. They basically contain the allele frequencies for every population at every base in the reference genome in a concise format. Note that the synchronized file format contains the allele frequencies after filtering for base quality.
samtools mpileup -B map/pop1.bam map/pop2.bam > p1_p2.mpileup perl <popoolation2-path>/mpileup2sync.pl --fastq-type sanger --min-qual 20 --input p1_p2.mpileup --output p1_p2.sync
Synchronizing the mpileup file is quite time consuming. To remove this bottleneck we implemented 'mpileup2sync' in Java multi-threading which is about 78x faster as the implementation in perl.
java -ea -Xmx7g -jar <popoolation2-path>/mpileup2sync.jar --input p1_p2.mpileup --output p1_p2_java.sync --fastq-type sanger --min-qual 20 --threads 8
Sample of a synchronized file:
2R 2302 N 0:7:0:0:0:0 0:7:0:0:0:0 2R 2303 N 0:8:0:0:0:0 0:8:0:0:0:0 2R 2304 N 0:0:9:0:0:0 0:0:9:0:0:0 2R 2305 N 1:0:9:0:0:0 0:0:9:1:0:0
The allele frequencies are in the format A:T:C:G:N:del, i.e: count of bases 'A', count of bases 'T',... and deletion count in the end (character '*
' in the mpileup)
The exact allele frequency differences may be computed as follows:
perl <popoolation2-path>/snp-frequency-diff.pl --input p1_p2.sync --output-prefix p1_p2 --min-count 6 --min-coverage 50 --max-coverage 200
This script creates two output files having two different extensions:
_
rc: this file contains the major and minor alleles for every SNP in a concise format _
pwc: this file contains the differences in allele frequencies for every pairwise comparision of the populations present in the synchronized file The allele frequency differences can be found in the _
pwc file, a small sample:
##chr pos rc allele_count allele_states deletion_sum snp_type most_variable_allele diff:1-2 2R 4459 N 2 C/T 0 pop T 0.133 2R 9728 N 2 T/C 0 pop T 0.116
The last column contains the obtained differences in allele frequencies for the allele provided in column 8. Note that in this example the last column refers to a pairwise comparision between population 1 vs 2, in case several populations are provided all pairwise comparisions will be appended in additional columns. For a comparison of the observed allele frequency differences with the expected ones please see: Validation
perl <popoolation2-path>/fst-sliding.pl --input p1_p2.sync --output p1_p2.fst --suppress-noninformative --min-count 6 --min-coverage 50 --max-coverage 200 --min-covered-fraction 1 --window-size 1 --step-size 1 --pool-size 500
the option --suppress-noninformative
suppresses output for windows not containing a SNP. When applied to windows of size one, this option suppresses output for bases that are no SNP.
For a validation of the Fst values please see Validation
perl <popoolation2-path>/fst-sliding.pl --input p1_p2.sync --output p1_p2_w500.fst --min-count 6 --min-coverage 50 --max-coverage 200 --min-covered-fraction 1 --window-size 500 --step-size 500 --pool-size 500
First index the two bam files
samtools index map/pop1.bam samtools index map/pop2.bam
Than convert the file containing the Fst-values into the .igv
format
perl <popoolation2-path>/export/pwc2igv.pl --input p1_p2.fst --output p1_p2.igv
Finally load the results into the IGV
.bam
files .igv
file containing the Fst values Voila, the result should look something like this:
Note that the result shown above only contain the Fst for a single pairwise comparison (pop1 vs pop2). PoPoolation2 is per default calculating the Fst for all possible pairwise comparisons between populations (when several populations are present in the synchronized file), and these results may just as easily be converted into the .igv
format. When loaded into IGV it may look like the following:
The Fishers exact test can be used to test whether any differences in allele frequencies are statistically significant. At low coverages the absolute changes of allele frequencies or the Fst values may be strongly influenced by sampling effects, therefore the Fishers exact test may be used to identify significant changes in allele frequency.
perl <popoolation2-path>/fisher-test.pl --input p1_p2.sync --output p1_p2.fet --min-count 6 --min-coverage 50 --max-coverage 200 --suppress-noninformative
Note that this script provides -log10(P-value) as output.
First convert the Fisher's exact test results into the .igv
file format:
perl <popoolation2-path>/export/pwc2igv.pl --input p1_p2.fet --output p1_p2_fet.igv
Than start the IGV, load the .bam
alignments and the file: p1_p2_fet.igv
. The result should be very similar as in the following example:
The cmh-test (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/mantelhaen.test.html) is used for detecting significant and consistent changes in allele frequencies when independent measurements of the allele frequencies have been obtained (e.g.: biological replicates). For example when you are interested in the SNPs responsible for plant height, you may obtain several measurements of the allele frequencies of tall and small plants from different ecological regions. Using the cmh-test you may want to detect consistent differences (consistent between small and tall) in allele frequency for the following comparisions: small_alaska vs tall_alaska + small_norway vs tall_norway + small_sibiria vs tall_sibiria
In this walkthrough it is necessary to create a data set having a biological replicate. We are just duplicating the last two columns of the synchronized file used above:
mkdir cmh cat p1_p2.sync|awk 'BEGIN{OFS="\t"}{print $0,$4,$5}' > cmh/p1_p2_p1_p2.sync
Here is an example of the result:
2R 2298 N 3:0:0:0:0:0 3:0:0:0:0:0 3:0:0:0:0:0 3:0:0:0:0:0 2R 2299 N 0:4:0:0:0:0 0:4:0:0:0:0 0:4:0:0:0:0 0:4:0:0:0:0 2R 2300 N 5:0:0:0:0:0 5:0:0:0:0:0 5:0:0:0:0:0 5:0:0:0:0:0
Compare the allele frequencies of popualtion 1 with population 2 and of population 3 with population 4 using the cmh-test.
perl <popoolation2-path>/cmh-test.pl --input p1_p2_p1_p2.sync --output p1_p2_p1_p2.cmh --min-count 12 --min-coverage 50 --max-coverage 200 --population 1-2,3-4
Note that it is necessary to double the --min-count
as we also doubled the number of populations. For a validation of the cmh-test please see Validation
First convert the results of the cmh-test into the .gwas
format of IGV.
perl <popoolation2-path>/export/cmh2gwas.pl --input p1_p2_p1_p2.cmh --output p1_p2_p1_p2.gwas --min-pvalue 1.0e-20
Note that IGV has some problems displaying very low p-values, thus the option --min-pvalue
was used introduced as threshold. P-values smaller than the threshold will be set to the value of the threshold.
Open the IGV as described before and load the file p1_p2_p1_p2.gwas
. The results should be similar to the following example
Download the annotated exons for the first 1mio bp of chromosome 2R http://popoolation2.googlecode.com/files/2R_exons.gtf
Convert the synchronized file into a gene-based synchronized file
perl <popoolation2-path>/create-genewise-sync.pl --input p1_p2.sync --gtf 2R_exons.gtf --output p1_p2_genes.sync
Calculate the Fst for every gene:
perl <popoolation2-path>/fst-sliding.pl --min-count 6 --min-coverage 50 --max-coverage 200 --pool-size 500 --min-covered-fraction 0.0 --window-size 1000000 --step-size 1000000 --input p1_p2_genes.sync --output p1_p2_genewise.fst
The output should be similar to the following example:
CG17683-RB 500000 17 0.002 100.0 1:2=0.01349929 CG40127-RA 500000 5 0.001 99.7 1:2=0.01897121 CG17665-RB 500000 27 0.003 99.9 1:2=0.03149757
The output above indicates that gene CG17683-RB
has an Fst of 0.01349929
, gene CG40127-RA
an Fst of 0.01897121
and so on. Of course this analysis may also be conducted with more than two populations, in which case again all pairwise comparisons will be calculated.
Similarly the fisher's exact test may be performed for every gene, or the CMH-test for the SNPs of every gene.
Note: this analysis may be performed for any feature (intron, cds etc) as long as it is present in the gtf file and has a unique gene_id
What to do now? You have successfully identified several hundreds SNPs being associated with a trait of interest (or environment) but what is their biological meaning? You may use Gowinda to test whether your SNPs show an enrichment for any GO category (or any gene set): http://code.google.com/p/gowinda/
Wiki: Home
Wiki: Main
Wiki: Validation
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: perrica...@hotmail.com
Hi, I have just tried to convert the synchronized file into a gene-based synchronized file, and the command runs, but the output file is empty. Any ideas? The input sync file works fine with other commands. Perhaps it is the gtf file, but looking at the example gft file on this page it looks to be fine. Thanks,
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: braud.ma...@gmail.com
There is a problem of column in the demo files. The command: samtools view -q 20 -bS map/pop1.sam | samtools sort - map/pop1 does not manage to create the bam file because of the value XA:Z: in the sam file, the error is: Parse error at line 36: missing colon in auxiliary data
What is the problem?
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: abh...@gmail.com
This is a great tool! I wanted to know if there is any way to get the SNPs in "_rc file" in VCF format? to be used in outlier detecting programs _
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: elzedliu
I am wondering if you guys can let me know a way to prepare input file for bayescan after popoolation?
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: alicebde...@gmail.com
Hi, I'm trying to use the cmh-test above, but should something in the cmd below instead read p1_p2_p3_p4? I'm confused where the second half of the data comes into this!
perl <popoolation2-path>/cmh-test.pl --input p1_p2_p1_p2.sync --output p1_p2_p1_p2.cmh --min-count 12 --min-coverage 50 --max-coverage 200 --population 1-2,3-4
Thanks
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: amritaya...@gmail.com
hello, I wanted to know if there is any way to get the SNPs in rc file in VCF format?
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: chengjie...@gmail.com
Hi,I am trying the fst for SNPs between 2 populaitons using the fst-sliding.pl . I am bit confused about the "min-coverage", for example, if I have 2 populations and set the min-coverage of 5, is that means 5 reads are required for each population at this SNP, or 10 reads for the 2 populaitons but can be 8:2, 7:3 ....for each?
Thank you!
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: beatrizj...@gmail.com
Hi everyone, Fisrt, thanks for the tool, it has been a great finding for my research. I'm trying to understand how fst_slinding.pl calculates the FST value. In the --help it says how it calculates, but I try to do it by myself and I have been not able to do it. I think it makes a correction by pool-size, but I don't know how? I'm using the script with an step size of 1. Thanks
Hi everyone. I'm trying to calculate fst and Fisher's p-values. These scripts take hours to run. I let the fst script go for over 20h on my university cluster, and it couldn't even get through one chromosome of data. The fisher-test.pl takes similarly long, but it doesn't save any information in its output file. My pools/bulks are composed of 45 and 44 individuals with average read depth of ~150. Is it the expected behvior or am I doing something wrong?
Examples of my scripts:
perl fisher-test.pl \ --input Low_vs_High_w_ref.mpileup.sync \ --output Low_vs_High-fisher.fet \ --min-count 5 --min-coverage 50 \ --step-size 1000 --window-size 1000 \ --max-coverage 800 --suppress-noninformative
perl fst-sliding.pl \ --input /Low_vs_High_w_ref.mpileup.sync \ --output /Low_vs_High.fst \ --min-count 10 --min-coverage 50 --max-coverage 800 --window-size 1 --step-size 1 --pool-size 45:44
Well thats lots of data. the two scripts fst and fet compute all pairwise comparisions and thats a lot for 45 to 44 individuals... I would try it with less individuals, and once this works scale it up to more.
Thank your for a suggestion, but I think I should explain it better. My data contains pooled sequencing data. Aligned reads of individuals, with separate alignment for each individual, were merged by samtools merge, then realigned using GATK, and variants were called by samtools mpileup. Thus, the sync file does not contain information about variant calling in each individual, but in two pools. How would that be different from running these computations for just two samples? Should I change the --pool-size parameter to 1:1 then? I run Fisher exact test using plink on exactly the same dataset, but preserving the information about individuals, and it was much faster.
I'm trying to access the data for the tutorial, however the link appears to be dead. I've tried accessing it with two seperate browsers. How do I access the data?
Really appreciate having this software availalbe! Currently I'm using it for BSA in fruitfly.
In computing Fst I'm confused as to what "pool-size" means. I've seen it mentioned as referring to both individuals and to number of chromosomes (and can see arguments for both). Also, what is the correct syntax for applying it when pools are unequal?--a poster above uses a ":" delimiter so is that correct? And if so am I safe in assuming that "x:y" would apply to the pools in the order in which they were passed into ".sync" file?
Please, can you upload the igv png images to check whether my results look ok?
I cannot acces to thouse pictures:
"Voila, the result should look something like this:
Note that the result shown above only contain the Fst for a single pairwise comparison (pop1 vs pop2). PoPoolation2 is per default calculating the Fst for all possible pairwise comparisons between populations (when several populations are present in the synchronized file), and these results may just as easily be converted into the .igv format. When loaded into IGV it may look like the following:"
Thank you very much,
Last edit: Maria Cortazar Chinarro 2021-01-03
Hi , I downloaded the online tutorial on popoopulation2 and had some problems in its execution. Could anyone assist pls?
problem
created mpileup successfully
created syn file successfully
fisher-test.pl fails (stops on line 9 and 10)
Exact message from the terminal:
poolseq_test$: perl popoolation2_1201/fisher-test.pl --input p1_p2.sync --output p1_p2.fet --min-count 6 --min-coverage 50 --max-coverage 200 --suppress-noninformative
Can't locate Text/NSP/Measures/2D/Fisher/twotailed.pm in @INC (you may need to install the Text::NSP::Measures::2D::Fisher::twotailed module) (@INC contains: /home/Jako/Documents/poolseq_test/popoolation2_1201 /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules/FET.pm line 10.
BEGIN failed--compilation aborted at /home/Jako/Documents/poolseq_test/popoolation2_1201/Modules/FET.pm line 10.
Compilation failed in require at popoolation2_1201/fisher-test.pl line 9.
BEGIN failed--compilation aborted at popoolation2_1201/fisher-test.pl line 9.
Looking forward to hearing from you or any other person willing to help.
many thanks in advance
Last edit: Jonas A Kengne 2021-03-02
Hello Mr. Kengne,
I had the same problem and I managed to get the perl script running by following this solution in the comments here:
https://code.google.com/archive/p/popoolation2/issues/19
I hope this helps!
Hi everyone,
I am using Popoolation2 sofware to calculate allele frequency and Fst now. But when I calculate allele difference of my populations, it generates several locus with 1 allele count (only one allele state). Commonly SNPs is a biallele or in some cases it can contain 3 alelles or 4 alleles. And in the minimum coverage, I set it 10 but in the rc file it generates the SNP with coverage less than 10. Do you think there is an error that happening when I use the Popoolation2?
I am really appreciate if you can help me figured out what is happening. Thank you.
I just found out that it was rc SNPs not pop SNPs type.
Hello Community
Is it possible to annotate snps from snpeff eff tool after getting variants? which file i need to annoate?
Hi everyone,
I noticed that when converting an mpileup file to a sync file, positions on a genome are shifted.
For example, position 10,923,271 in a reference genome used for mapping was shifted to 10,923,753 in a sync file.
Does anyone know how to resolve this error?
Hello everyone,
I am using popoolation2 to perform the CMH test between 4 pairs of populations. The issue is that it's giving me very high p-values in general (very low actually), less than 10e-60. Does anyone know how I should choose the p-value cutoff? Or if there's any correction I should apply to these values? Thank you!
Hello Mr. Bennardo,
I think the two most common ways of adressing multiple testing in GWAS-like data is either using a Bonferroni corrected p-value threshold or by using a False Discovery Rate (FDR) approach. The documentation for the R package "qvalue" is very well written:
https://www.bioconductor.org/packages/release/bioc/manuals/qvalue/man/qvalue.pdf