Recent changes to Walkthrough

Walkthrough modified by Robert Kofler

Robert Kofler — Thu, 07 Apr 2016 13:18:31 -0000

--- v57
+++ v58
@@ -72,7 +72,7 @@
 # here we analyse TE abundance in the three populations separately
 java -jar ~pt2/popte2.jar identifySignatures --ppileup Fra_Geo_Gha.ppileup.gz --mode separate --output quick/Fra_Geo_Gha.signatures --min-count 3
 java -jar ~pt2/popte2.jar frequency --ppileup Fra_Geo_Gha.ppileup.gz --signature quick/Fra_Geo_Gha.signatures --output quick/Fra_Geo_Gha.freqsig 
-java -jar ~pt2/popte2.jar pairupSignatures --signature quick/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier-file walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output quick/Fra_Geo_Gha.teinsertions
+java -jar ~pt2/popte2.jar pairupSignatures --signature quick/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output quick/Fra_Geo_Gha.teinsertions
 ~~~~~

 Voila, we have identified 311 TE insertions: 108 in France, 109 in Georgia and 92 in Ghana. The coverage varies among the populations and therefore these differences in TE abundance may be an artefact of the coverage. In the next section we demonstrate how to  homogenize the physcial coverage within and between samples, and thus to homogenize the power to identify TEs.
@@ -131,7 +131,7 @@
 :::bash
 java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode separate --output hypo1/Fra_Geo_Gha.signatures --min-count 2
 java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo1/Fra_Geo_Gha.signatures --output hypo1/Fra_Geo_Gha.freqsig 
-java -jar ~pt2/popte2.jar pairupSignatures --signature hypo1/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier-file walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo1/Fra_Geo_Gha.teinsertions
+java -jar ~pt2/popte2.jar pairupSignatures --signature hypo1/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo1/Fra_Geo_Gha.teinsertions
 ~~~~~

 ** Does hypothesis-1 hold?
@@ -169,7 +169,7 @@
 mkdir hypo2
 java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode joint --output hypo2/Fra_Geo_Gha.signatures --min-count 2 --signature-window minimumSampleMedian
 java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo2/Fra_Geo_Gha.signatures --output hypo2/Fra_Geo_Gha.freqsig
-java -jar ~pt2/popte2.jar pairupSignatures --signature hypo2/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier-file walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo2/Fra_Geo_Gha.teinsertions
+java -jar ~pt2/popte2.jar pairupSignatures --signature hypo2/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo2/Fra_Geo_Gha.teinsertions
 ~~~~~

 **Does hypothesis-2 hold?**

Walkthrough modified by Robert Kofler

Robert Kofler — Tue, 23 Feb 2016 09:47:26 -0000

--- v56
+++ v57
@@ -147,7 +147,7 @@

 ## 4.) Hypothesis-2: requires a joint analysis##
 To estimate the population frequency of a TE insertion of interest (the *accord* insertion next to Cyp6g1 gene) we need to perform a joint analysis.
-When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inprecsion in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples. 
+When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inprecision in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples. 

 For example, when using *--mode separate*  the following three *accord* entries are reported. These three entries probably refer to the same TE insertion  (they have similar genomic positions and family).

Walkthrough modified by Robert Kofler

Robert Kofler — Tue, 23 Feb 2016 09:46:12 -0000

--- v55
+++ v56
@@ -95,7 +95,7 @@
 ## 3.) Hypothesis-1: requires subsampling of the ppileup file ##
 In this section we aim to answer the hypothesis that the population from Ghana has more TE insertions than the other two populations (for an explanation see above; section hypothesis). The physical coverage has a substantial influence on the power to identify TE insertions, where the power to identify TE insertions increases with the coverage. This may not be a dramatic problem for sequencing individuals where TE abundance quickly saturates, i.e it asymptotically approaches an upper value with increasing read numbers. However, with Pool-Seq data such an upper-limit is not readily reached (Pool-Seq data are frequently unsaturated for TEs) and therefore coverage differences may be the major reason for differences in TE abundance among samples.

-To address this problem we will subsample the ppileup tracks to equal physical coverages between and within samples.
+To address this problem we will subsample the ppileup tracks to equal physical coverage between and within samples.

 ** Estimate the optimal target coverage**
 However,  it will first be necessary to estimate the optimal target coverage. This optimum will be a compromise between two opposing considerations. First, the target coverage should be as high as possible, as this increases the power to identify TE insertion and the accuracy of the allele frequency estimates. Second if it is too high we will loose a substantial fractions of sites.  PoPoolationTE2 only proceeds with sites having sufficient coverage in all samples (>=target coverage). 
@@ -147,7 +147,7 @@

 ## 4.) Hypothesis-2: requires a joint analysis##
 To estimate the population frequency of a TE insertion of interest (the *accord* insertion next to Cyp6g1 gene) we need to perform a joint analysis.
-When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inpression in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples. 
+When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inprecsion in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples. 

 For example, when using *--mode separate*  the following three *accord* entries are reported. These three entries probably refer to the same TE insertion  (they have similar genomic positions and family).

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 13:12:35 -0000

--- v54
+++ v55
@@ -245,7 +245,8 @@
 # te      hobo    147
 # te      P-element       164
 ~~~~~
-Thus the P-element, popg and hobo are quit abundant in the population sample from France; For an explanation of the output file see [diverse output files]
+
+Thus the P-element, popo and hobo are quit abundant in the population sample from France; For an explanation of the output file see [diverse output files]

 ** Number of PE fragments supporting a TE insertion **

@@ -260,11 +261,11 @@
 # te      P-element       163
 ~~~~~

-Again P-element, pogo and hobo are quite abundant. Note that our example is exceptional since we prefiltered the data for reads mapping in chromosome 2R:11Mb-13Mbp and this is also the reason why the read-statistics and the pair-statistics are quite similar in our example. For an explanation of the output file see [diverse output files]
-
-
-
-
-
-
-
+Again P-element, pogo and hobo are quite abundant. Note that our example is exceptional since we prefiltered the data for reads mapping in chromosome 2R:11Mb-13Mbp, which explains why the read-statistics and the pair-statistics are quite similar in our example. Such a high similarity would not be expected for unfiltered data! For an explanation of the output file see [diverse output files]
+
+
+
+
+
+
+

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 13:07:10 -0000

--- v53
+++ v54
@@ -216,7 +216,7 @@

 ## 6.) Filtering TE insertions ##
-TE insertions that are overlapping with other TE insertions or structural variants may not be reliable. Additionally estimates of the population frequency may also be unreliable. To remove such unreliable TE insertions PoPoolationTE2 allows to filter TE insertions.
+TE insertions that are overlapping with other TE insertions or with structural variants may not be reliable. Also the estimates of the population frequency may not be reliable. To remove such low-quality TE insertions PoPoolationTE2 allows to filter TE insertions.

 ~~~~~
 :::bash
@@ -227,9 +227,9 @@
 ~~~~~

-## 7.) Obtaining basic statistics##
-It will always be important to double check the results of TE identification with some basic statistics.
-In our opinion the most important statistics are a.) the number of reads mapping to the different TE families and b.) the number of paired-end fragments supporting a TE insertions, i.e. PE fragments where one read maps to a reference genome and one read to a TE (these fragments are also used for generating the ppileup).
+## 7.) Creating basic statistics##
+It will **always** be important to cross-check your results with some basic statistics.
+In our opinion the most important statistics are a.) the number of reads mapping to different TE families and b.) the number of paired ends supporting TE insertions, i.e. discordantly mapped paired ends where one read maps to a reference genome and the other to a TE (these paired ends are used for generating the ppileup).

 ** Number of reads mapping to a TE**

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 13:03:50 -0000

--- v52
+++ v53
@@ -184,12 +184,15 @@
 http://www.ncbi.nlm.nih.gov/pubmed/15245421

 However the population frequencies of the accord insertion reported here are lower than in the previous study. This is probably because the region arround the accord insertion has a complex history of nested TE insertions and structural rearangments.
-In fact upon inspection of the ppileup file we found that the accord insertion is overlapping wit  *P-element* and *HMS-Beagle* insertions - in agreement with previous works http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000998 - which accounts for the low population frequency of the *accord* insertion.
+In fact upon inspection of the ppileup file we found that the accord insertion is overlapping with  *P-element* and *HMS-Beagle* insertions, again in agreement with previous works http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000998. This overlap with other TE insertions may explain the low population frequency estimates of the *accord* insertion.

 ## 5.) Strand of TE insertions ## 
-As a new feature introduced with PoPoolationTE2 (as compared to PoPoolationTE) the strand of TE insertions may be estimated.
+As a new feature (compared to PoPoolationTE),  PoPoolationTE2  allows to estimate the strand of TE insertions .
+
+This will be done by updating the strand information in the signature file [signature file format].
+

 ~~~~
 :::bash

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 12:59:47 -0000

--- v51
+++ v52
@@ -89,6 +89,7 @@
 2       2R      12318674        .       Tc1-2   TIR     FR              0.968
 ~~~~~

+For an explanation of the output file see [TE insertion file]

 ## 3.) Hypothesis-1: requires subsampling of the ppileup file ##
@@ -124,7 +125,7 @@

 ** Estimate TE abundance**

-TE abundance will be estimated as shown before. As only difference, we will use a *--min-count* of 2, since subsampling substantially reduced the physical coverage in the file.
+TE abundance will be estimated as shown before. The only difference, we will use a *--min-count* of 2, since subsampling substantially reduced the physical coverage in the file.

 ~~~~~
 :::bash
@@ -146,9 +147,9 @@

 ## 4.) Hypothesis-2: requires a joint analysis##
 To estimate the population frequency of a TE insertion of interest (the *accord* insertion next to Cyp6g1 gene) we need to perform a joint analysis.
-When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, although a TE insertion may actually occur in all samples. This may make it difficult to identify corresponding insertions, i.e. to identify the same insertion in all samples. 
-
-For example, when using *--mode separate*  the following three *accord* entries are reported. However, it is likely that these three entries refer to the same TE insertion  (since they have similar genomic positions and are of the same family).
+When using  *--mode separate*  a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inpression in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples. 
+
+For example, when using *--mode separate*  the following three *accord* entries are reported. These three entries probably refer to the same TE insertion  (they have similar genomic positions and family).

 ~~~~~
 :::bash

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 12:54:13 -0000

--- v50
+++ v51
@@ -79,7 +79,7 @@

 ** First results **

-Following a  sample from the resulting *Fra_Geo_Gha.teinsertions* file:
+A small sample from the resulting *Fra_Geo_Gha.teinsertions* file:

 ~~~~~
 :::bash
@@ -92,13 +92,13 @@

 ## 3.) Hypothesis-1: requires subsampling of the ppileup file ##
-In this section we address the hypothesis that the population from Ghana has more TE insertions than the other two populations (for an explanation see above; section hypothesis). The physical coverage has a substantial influence on the power to identify TE insertions, where the power to identify TE insertions increases with the coverage. This may not be a dramatic problem for sequencing individuals where TE abundance quickly saturates, i.e it asymptotically approaches an upper value with increasing read numbers. However, with Pool-Seq data such an upper-limit is not readily reached ( Pool-Seq data are frequently unsaturated) and therefore coverage differences may be a major cause for differences in TE abundance between samples. 
-
-To address this problem we will subsample the ppileup tracks to equal coverages in all samples and sites.
+In this section we aim to answer the hypothesis that the population from Ghana has more TE insertions than the other two populations (for an explanation see above; section hypothesis). The physical coverage has a substantial influence on the power to identify TE insertions, where the power to identify TE insertions increases with the coverage. This may not be a dramatic problem for sequencing individuals where TE abundance quickly saturates, i.e it asymptotically approaches an upper value with increasing read numbers. However, with Pool-Seq data such an upper-limit is not readily reached (Pool-Seq data are frequently unsaturated for TEs) and therefore coverage differences may be the major reason for differences in TE abundance among samples. 
+
+To address this problem we will subsample the ppileup tracks to equal physical coverages between and within samples.

 ** Estimate the optimal target coverage**
-First it will be necessary to estimate the optimal target coverage. This optimum will be a compromise between two opposing considerations. First, the target coverage should be as high as possible, because this increases the power to identify TE insertion and the accuracy of the allele frequency estimates. Second if it is too high we will loose substantial numbers of sites for the analysis.  PoPoolationTE2 only proceeds with sites with sufficient coverage in all samples (>=target coverage). 
-To provide some help for identifying the optimum coverage we will first estimate the coverage distribution in all samples
+However,  it will first be necessary to estimate the optimal target coverage. This optimum will be a compromise between two opposing considerations. First, the target coverage should be as high as possible, as this increases the power to identify TE insertion and the accuracy of the allele frequency estimates. Second if it is too high we will loose a substantial fractions of sites.  PoPoolationTE2 only proceeds with sites having sufficient coverage in all samples (>=target coverage). 
+To provide some assistance in identifying a suitable target coverage we will first estimate the coverage distribution in all samples

 ~~~~~
 :::bash

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 12:49:51 -0000

--- v49
+++ v50
@@ -75,7 +75,7 @@
 java -jar ~pt2/popte2.jar pairupSignatures --signature quick/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier-file walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output quick/Fra_Geo_Gha.teinsertions
 ~~~~~

-Voila, we have identified 311 TE insertions: 108 in France, 109 in Georgia and 92 in Ghana. The coverage between the populations varies, therefore these differences TE abundance may  be an artefact of the coverage. In the next section we demonstrate how to obtain an uniform physcial coverage for the samples.
+Voila, we have identified 311 TE insertions: 108 in France, 109 in Georgia and 92 in Ghana. The coverage varies among the populations and therefore these differences in TE abundance may be an artefact of the coverage. In the next section we demonstrate how to  homogenize the physcial coverage within and between samples, and thus to homogenize the power to identify TEs.

 ** First results **

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 12:47:47 -0000

--- v48
+++ v49
@@ -7,29 +7,28 @@
 + download and unzip the reads http://sourceforge.net/projects/popoolation-te2/files/walkthrough-reads.zip/download The reads are derived from Bergman and Hadrill 2015 (http://f1000research.com/articles/4-31/v1) and map to a subset of chromosome 2R ranging from position 11Mbp to 13Mpb; These data contain Pool-Seq reads for three *D. melanogaster* populations sampled from: France, Georgia and Ghana. 
 + I would recommend to read the manual before [Manual]; This would help understanding some concepts and file formats.

-**Note** for this Walkthrough we already provide the requirements for running PoPoolationTE2, i.e TE-merged-reference and a TE-hierachry [Manual]. The following Walkthroughdemonstrates how these files can be created  [WalkthroughPreparatoryWork] 
+**Note** for this Walkthrough we already provide the requirements for running PoPoolationTE2, i.e. a TE-merged-reference and a TE-hierachy [Manual]. The following demonstrates how these files can be created:  [WalkthroughPreparatoryWork] 

 # Hypothesis that will be tested #
 In this walkthrough we aim to test two hypothesis.

-+ Hypothesis1: The population from Ghana contains more TE insertions than the populations from Georgia and France. Previous works showed that TE activity and the extend of hybrid dysgenesis is temperature sensitive. We thus want to test the hypothesis that higher temperatures lead to higher TE activity.
-+ Hypothesis2: An accord insertion next to the Cyp6g1 in *D. melanogaster*, confers resistance to insecticides, and thus quickly rose in frequency in *D. melanogaster* populations. It is likely that pesticides are less extensively used in Ghana than in France or Georgia. We thus want to test the hypothesis that the accord insetion, confering resistance to pesticides, has  a lower population frequency in Ghana than in France or Georgia.
++ Hypothesis1: The population from Ghana contains more TE insertions than the populations from Georgia as well as France. Previous works showed that TE activity and the extend of hybrid dysgenesis is temperature sensitive. We thus want to test the hypothesis that higher temperatures leads to higher TE activity.
++ Hypothesis2: An accord insertion next to the Cyp6g1 in *D. melanogaster*, confers resistance to insecticides and therefore it quickly rose in frequency in different *D. melanogaster* populations. It is likely that pesticides are less extensively used in Ghana than in France or Georgia. We thus want to test the hypothesis that the accord insertion, confering resistance to pesticides, has a lower population frequency in Ghana than in France and Georgia.

 # Identifying TE insertions with PoPoolationTE2#
-Please proceed with the walkthrough in the given order, since files generated in earlier steps may be required at later steps.
+Please proceed with the walkthrough in the given order, since files generated in earlier steps may be required later on.

 ## 1.) Building the ppileup file##
 The ppileup file is the basis for all subsequent analysis and thus central for identifiying TEs with PopoolationTE2.
- Map reads to the reference genome ##
-
-In this walkthrough the reads will be mapped with a local alignment algorithm bwa bwasw. In a simulation study we found that local alignment algorithm yield a higher power to identify TE insertions than semiglobal algorithm (Bowtie2 --end-to-end; BWA ALN). With bwasw both reads may either be mapped directly to a reference genome or the reads may be mapped separately and the paired-end information may be restored subsequently by using PoPoolationTE2. In a simulation study we found that the second approach has a slightly better performance and we thus demonstrate this approach.
+
+ The reads will be mapped with the local alignment algorithm bwa bwasw.  Both reads will be mapped separately to the TE-merged-reference and the paired end information will be restored subsequently with PoPoolationTE2 (*se2pe*)

 ** Map reads to the TE-combined-reference**

 ~~~~~
 :::bash
-# Note the sample data set consists of Pool-Seq data for three D. melanogaster populations
+# Note the sample data set consists of Pool-Seq reads for three D. melanogaster populations
 # France, Georgia, Ghana
 bwa index walkthrough-refgenome/2R-603-consensusTE.fasta
 mkdir map
@@ -64,7 +63,7 @@

 ## 2.) Minimum walkthrough ##
-The following example demonstrate the minimum pipeline for identifying TE insertions with PoPoolationTE2. More complexity will gradually be added later on. We need to 1.) identify signatures of TE insertions 2.) estimate the abundance of different ppileup-tracks at the signatures and 3.) pair-up signatures of TE insertions to obtain TE insertions
+The following example demonstrates the minimum pipeline for identifying TE insertions with PoPoolationTE2. More complexity will gradually be added in subsequent steps later on. We need to 1.) identify signatures of TE insertions 2.) estimate population frequencies and 3.) pair up signatures of TE insertions,  yielding a final list of TE insertions.

 ~~~~~