The software is designed for next generation sequencing data of pooled genomic DNA. Moreover at least a partial assembly of the used species should be available.
We do not recommend using the software with cDNA or unigenes as these sequences do not contain introns which will negatively affect the mapability of genomic reads. Furthermore we do not recommend to use the software on pooled RNA samples as RNA is subject to many modifications which will distort the population genetic measures, like differential gene expression, RNA editing, alternative splicing, allele specific gene expression.
PoPoolation may be obtained directly from the subversion repository. Just go to the directory where you want to install PoPoolation in the command line and enter the command:
svn checkout http://popoolation.googlecode.com/svn/trunk/ popoolation
To update your copy of PoPoolation with the latest improvements enter your PoPoolation directory and enter the command:
svn update
Alternatively PoPoolation may be downloaded from the project main page: http://code.google.com/p/popoolation/
However we recommend to use subversion as bugfixes will be immediately available in the repository.
Help for all PoPoolation scripts can be requested using the '--help' option. For example
perl trim-fastq.pl --help
The main scripts of PoPoolation contain Unit tests which can be run using the '--test' option For example:
perl trim-fastq.pl --test
Calculates Tajima's Pi or Wattersons Theta or Tajima's D along chromosomes using a sliding window approach. The variable sites (SNPs) used for calculating the requested Population Genetic measure may be output to a separate file. The user can choose a step size, a window size, the minimum base quality, the minium allele count, a minimum coverage and a maximum coverage.
Calculates Tajima's Pi or Wattersons Theta or Tajima's D for target regions. The target regions have to be provided as gtf file. Genes frequently consist of several exons, the script groups all regions which have the same "gene_id". The script is also able to handle overlapping regions. If the overlapping regions belong to the same gene (e.g.: sometimes an exon is reported for every transcripts belonging to the same gene), the overlapping region is only considered once. If the overlapping regions belong to different genes they are treated ndependently. If requested the SNPs used for calculating the Population Genetic measure may be output to distinct file. Again the user may choose a minimum base quality, a minimum allele count, a minimum coverage and a maximum coverage.
Converts the output of Variance-sliding.pl into a pdf. Thus a overview of the general pattern of polymorphism for the provided chromosomes (contigs) may be quickly created.
Converts the output of Variance-sliding.pl into a wiggle file (http://genome.ucsc.edu/goldenPath/help/wiggle.html). Wiggle files are for example accepted by the Integrative Genomics Viewer (http://www.broadinstitute.org/igv/) or by the UCSC Genome Browser (http://genome.ucsc.edu/).
Converts the output of Variance-sliding.pl into a format which is compatible with the FlyBase Genome browser.
Parses the output of the Mauve Genome Alignment Software (http://asap.ahabs.wisc.edu/mauve/). To determine the ancestral allele it is necessary to align three species where one acts as an outgroup, this may be accomplised with Mauve. This script converts a Mauve multiple genome alignment into a single tab delimited table where one genome (the first) acts as the reference.
Calculates the divergence of the species along chromosomes using a sliding window approach. This scripts requires the parsed Mauve output (or any other Multiple alignment software as long as the above mentioned tab delimited table is produced; For more information use: perl calculate-divergence.pl --help).
This script trims reads in the fastq file format and outputs again a fastq file. Reads may either be trimmed as singel end reads or paired end reads. In case of paired end reads three output files are produced one for the first read, one for the second read and one for all reads which lost their mate because of not fulfilling the requirements (i.e.: they have become single end reads). First a trimming of the character 'N' is performed and secondly a quality trimming using a modified Mott algorithm as implemented by Phred and described here (http://www.phrap.org/phredphrap/phred.html) is performed. Reads which do not have the required minimum length are discarded. If only one read of a pair has the required minimum length the paired end will be broken and the read will be reported as single end (SE).
Indels frequently cause misalignments and thus may lead to false positive SNPs. This script masks the region surrounding an indel in the sam file. The sam file may be converted into a pileup file (samtools) and used with PoPoolation
Identifies indel regions from a pileup file and outputs the coordinates into a gtf file. This gtf file may for example be used to filter regions surrounding the indel from a pileup file (see basic-pipeline/filter-pileup-by-gtf.pl).
Allows to filter a pileup file by the regions specified in a gtf file. The specified regions may either be kept or discarded. For example the output of RepeatMasker may be used directly with this script to remove repetitive regions. Or as another example exons/CDS sequences may be removed before calculating Tajima's D. Or another examples whole regions surrounding indels may be removed (see basic-pipeline/identify-genomic-indel-regions.pl). The pileup file may subsequently be use with PoPoolation
Convert between the different quality encodings of a fastq file. For example convert a illumina encoded fastq file to a sanger encoded fastq file
May be used to create a uniform coverage in a pileup file. The bases of a pileup file may be randomly sampled to reduce the coverage to the given threshold. Two methods may be used for subsampling: either random drawing without replacement or a simple down scaling.
Popoolation offers two options for calculating Tajima's D. For both options we recommend to subsample the coverage to a uniform coverage.
As a basis for the equation for calculating Tajima's D we are using the results from Achaz (2008; http://www.ncbi.nlm.nih.gov/pubmed/18562660) for individual sequencing when singletons are omitted (detailed equations please see http://popoolation.googlecode.com/files/correction_equations.pdf). The difficulty arises when calculating the variance of Tajima's D. The following simplifying assumptions are made in our variance approximation:
For a correlation between the classical Tajima's D and the corrected Tajima's D using the equation described above please see (x: classical, y: corrected). Correlation has been done with real data from Drosophila with a coverage of 12, a window size of 500 and a minimum count of 1)
What to do after you successfully used PoPoolation, having identified several hundreds (or thousands) of SNPs with unusual low (or high) Tajima's D or Watterson's Theta. You may use Gowinda to test whether your SNPs show an enrichment for any GO category (or any gene set): http://code.google.com/p/gowinda/
Wiki: Main
Wiki: PoPOOLationWalkthrough
Wiki: PoPoolationWaltkthroughDivergence
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: yyzhuyon...@gmail.com
Hello Sir,
When I run Variance-sliding.pl script, the input file is pileup format. But, you know, the new version of samtools 0.1.8 "mpileup" appears to superseded "pileup" for variant detection. So, can I use mpileup format as input file?
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: yyzhuyon...@gmail.com
Then, I did those that using mpileup format as input file, and executed Variance-sliding.pl script, but the result is totally strange (as following), I did not know what's wrong happened? Chr1 50000 0 0.000 na Chr1 150000 0 0.000 na Chr1 250000 0 0.000 na Chr1 350000 0 0.000 na Chr1 450000 0 0.000 na All the value of sliding windows is na.
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: ryan.ame...@gmail.com
Can you provide a header line for the output file for SNPs produced by variance-sliding.pl. It seems like it reports the number of reads with each base at a SNP position. If this is true it would be useful to know which columns correspond to which nucleotide.
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: robin.n....@gmail.com
Values of 0 and na in my case came from using a bam file generated from reads using phred33 scores when the default of the Variance-sliding.pl script is to use phred64 encoding. Specify "--fastq-type sanger" to use the phred33 scoring scheme.