<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to Validation</title><link>https://sourceforge.net/p/popoolation2/wiki/Validation/</link><description>Recent changes to Validation</description><atom:link href="https://sourceforge.net/p/popoolation2/wiki/Validation/feed" rel="self"/><language>en</language><lastBuildDate>Mon, 16 Mar 2015 14:31:30 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/popoolation2/wiki/Validation/feed" rel="self" type="application/rss+xml"/><item><title>Discussion for Validation page</title><link>https://sourceforge.net/p/popoolation2/wiki/Validation/</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Originally posted by: liugang...@gmail.com&lt;/p&gt;
&lt;p&gt;great, very useful &lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Anonymous</dc:creator><pubDate>Mon, 16 Mar 2015 14:31:30 -0000</pubDate><guid>https://sourceforge.net523d2adf1a1008460e9f3fada99fd2b35bbabe35</guid></item><item><title>Validation modified by Anonymous</title><link>https://sourceforge.net/p/popoolation2/wiki/Validation/</link><description>&lt;div class="markdown_content"&gt;&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Data and study design&lt;/li&gt;
&lt;li&gt;Number of SNPs&lt;/li&gt;
&lt;li&gt;Observed and expected allele frequency differences&lt;/li&gt;
&lt;li&gt;Observed and expected Fst values&lt;/li&gt;
&lt;li&gt;Observed and expected CMH-test p-values&lt;/li&gt;
&lt;li&gt;Observed and expected Fisher exact test p-values&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="introduction"&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a class="" href="/p/popoolation2/wiki/Tutorial"&gt;Tutorial&lt;/a&gt; a small sample data set consisting of simulated fastq-reads was introduced, containing simulated data reads of two different populations. We introduced 10.000 SNPs into the sample and simulated varying allele frequencies for these SNPs. SNPs were only introduced in unambigous regions of the genome and a sequencing error rate of 1% was used. As the expected allele frequencies, for these SNPs are known, we can compare them with the observed ones allowing to validate the functionality of PoPoolation2. &lt;/p&gt;
&lt;h1 id="data-and-study-design"&gt;Data and study design&lt;/h1&gt;
&lt;p&gt;A proper validation requires that the expected values are obtained by different scripts than the observed values. For this reason we calculated all observed values (e.g: CMH P-values, Fst-values, allele frequency difference) with PoPoolation2 and all expected values with separate/independent scripts (see 'validation-scripts'). &lt;/p&gt;
&lt;p&gt;Furthermore all expected values are directly calculated from the targeted allele frequencies &lt;a href="http://popoolation2.googlecode.com/files/expected-snp-frequencies.sync" rel="nofollow"&gt;http://popoolation2.googlecode.com/files/expected-snp-frequencies.sync&lt;/a&gt;, whereas the observed values are calculated from the results obtained after simulating reads and mapping them to the reference genome &lt;a href="http://popoolation2.googlecode.com/files/observed.sync.zip" rel="nofollow"&gt;http://popoolation2.googlecode.com/files/observed.sync.zip&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the 'validation-scripts' can be found here: &lt;a href="http://popoolation2.googlecode.com/files/validation_scripts.zip" rel="nofollow"&gt;http://popoolation2.googlecode.com/files/validation_scripts.zip&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The expected allele frequencies for the 10.000 SNPs can be found here &lt;a href="http://popoolation2.googlecode.com/files/expected-snp-frequencies.sync" rel="nofollow"&gt;http://popoolation2.googlecode.com/files/expected-snp-frequencies.sync&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The observed allele frequencies of all bases, including the 10.000 SNPs, can be found here &lt;a href="http://popoolation2.googlecode.com/files/observed.sync.zip" rel="nofollow"&gt;http://popoolation2.googlecode.com/files/observed.sync.zip&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="number-of-snps"&gt;Number of SNPs&lt;/h1&gt;
&lt;p&gt;First we tested whether all 10.000 simulated SNPs were recovered by PoPoolation2.We found 9.999 were recovered and a single SNP (at position 1,120,507) was missing. This SNP is missing due to a low coverage in one population (15), causing the SNP to be ignored during filtering (minimum coverage 50). We furthermore found that for this SNP, only one allelic state was present ('C') and the other allelic state ('T') was entirely missing. We can only speculate as to what is causing this problem, we however suspect ambiguous mapping of reads with the missing allelic state ('T'). &lt;/p&gt;
&lt;p&gt;We furthermore identified 211 SNPs that were not in the expected set of SNPs, thus this 211 SNPs are entirely due to sequencing errors and false alignments. However these 'artefactual' SNPs show very small allele frequency differences (mean: 0.02386) and are thus not likely to cause signals of differentiation between the two populations. See also the following histogram of the allele frequency difference of these 211 SNPs. &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/aretefactual_snps.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;h1 id="observed-and-expected-allele-frequency-differences"&gt;Observed and expected allele frequency differences&lt;/h1&gt;
&lt;p&gt;We found a strong correlation between the expected and the observed allele frequency differences (R^2=0.9979; P &amp;lt; 2.2e-16; 9,999 tested SNPs), demonstrating that PoPoolation2 highly reliably recovers allele frequency differences. See also the following graph for the detailed correlation between observed and expected allele frequency differences: &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/correlation.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;The 95% quantiles of the error in the estimated allele frequency differences are -2.0% and 2.4%. That is, 95% of the estimated allele frequency differences are within -2% to 2.4% of the real allele frequency difference. A detailed distribution of the error in the estimated allele frequency difference can be found in the following graph: &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/errordistri.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;h1 id="observed-and-expected-fst-values"&gt;Observed and expected Fst values&lt;/h1&gt;
&lt;p&gt;We found a very strong correlation between the expected and the observed Fst values (R^2 = 0.9967; P &amp;lt; 2.2e-16). The observed Fst was calculated for every SNP as shown in the &lt;a class="" href="/p/popoolation2/wiki/Tutorial"&gt;Tutorial&lt;/a&gt;. For details see the following graph: &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/fst-correlation.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;When computing the difference between observed and expected Fst we found a small bias, see graph: &lt;img alt="" src="http://popoolation2.googlecode.com/files/fst-error.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;This distribution is shifted towards the right side, which means that on average the observed Fst values are smaller than the expected ones, thus the Fst estimate is slightly underestimated. &lt;/p&gt;
&lt;h1 id="observed-and-expected-cmh-test-p-values"&gt;Observed and expected CMH-test p-values&lt;/h1&gt;
&lt;p&gt;Observed CMH values were created as shown in the &lt;a class="" href="/p/popoolation2/wiki/Tutorial"&gt;Tutorial&lt;/a&gt;. We used the -log10 of the p-values and found a strong correlation between the observed and the expected p-values (Spearman's rank correlation: Rho= 0.9990084, P &amp;lt; 2.2e-16; Linear model: R^2=0.9978, P &amp;lt; 2.2e-16). For details see the following graph. &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/cmh_exp_obs.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;The following graph shows the error distribution of the CMH-values &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/cmh_exp_obs_hist.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;Again a small bias can be found, in general the observed data show slightly elevated log transformed p-values as compared to the expected ones. &lt;/p&gt;
&lt;h1 id="observed-and-expected-fisher-exact-test-p-values"&gt;Observed and expected Fisher exact test p-values&lt;/h1&gt;
&lt;p&gt;We calculated the significance of allele frequency differences using a Fisher's exact test as described in &lt;a class="" href="/p/popoolation2/wiki/Tutorial"&gt;Tutorial&lt;/a&gt;. We found a strong correlation (Spearman's rank correlation: Rho=0.9989923, P &amp;lt; 2.2e-16; Linear model: R^2=0.9974, P &amp;lt; 2.2e-16) between the observed and expected p-values obtained with Fisher's exact test, for details see: &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/obs_expected_fet.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;Note that we used the -log10(p-value) for the correlation. &lt;/p&gt;
&lt;p&gt;For a distribution of the errors obtained with the Fisher's exact test see the following graph: &lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="http://popoolation2.googlecode.com/files/fet_errordist.png" rel="nofollow" /&gt;&lt;/p&gt;
&lt;p&gt;Also the p-values calculated with the Fisher's exact test show the small bias mentioned above, in general the observed data show slightly elevated log transformed p-values as compared to the expected ones. &lt;/p&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;in the paragraphs above, we tested the main functionality of PoPoolation2 &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;measure allele frequency differences between populations &lt;/li&gt;
&lt;li&gt;calculate pairwise Fst-values between populations &lt;/li&gt;
&lt;li&gt;use the Fisher's exact test to estimate the significance of allele frequency differences between populations &lt;/li&gt;
&lt;li&gt;compute the cmh-test for estimating the significance of allele frequency differences when having several biological replicates &lt;br /&gt;
As we used separate scripts for calculate the observed and the expected values, we conclude that PoPoolation2 highly accurately reproduces differences in allele frequencies between populations. However, small differences are still found between observed and expected values (see above), which may be caused by the simulated sequencing errors (1% error rate) or inaccuracies during mapping of the reads. &lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Anonymous</dc:creator><pubDate>Mon, 16 Mar 2015 14:31:30 -0000</pubDate><guid>https://sourceforge.net8876a154f8b9e537fa07fdb9bd412f40707662b3</guid></item></channel></rss>