Recent changes to Manual

Manual modified by Robert Kofler

Robert Kofler — Fri, 29 Apr 2016 12:53:12 -0000

--- v87
+++ v88
@@ -220,6 +220,8 @@
 + --max-structvar-fraction: the maximum allowed frequency of structural  variants; All samples need to meet this requiremenet; default=1.0
 + --help: display a help message

+**Note**: We strongly recommend to filter overlapping TE insertions as frequency estimates may not be reliable (*--max-otherte-count*)
+
 ## pairupSignatures##

 This step pairs matching signatures of TE insertions, generating the final result, a list of TE insertions.

Manual modified by Robert Kofler

Robert Kofler — Fri, 19 Feb 2016 22:37:28 -0000

--- v86
+++ v87
@@ -230,6 +230,7 @@
 java -jar popte2.jar pairupSignatures --signature topair.signatures --ref-genome temerged-reference.fasta --hier tehier.txt --output teinsertions.txt 
 ~~~~~

+
 **Parameters**

 + --signature: signatures of TE insertions that ought to be paired; mandatory
@@ -290,7 +291,7 @@
 This option allows to generate coverage statistics for the ppileup file

 ~~~~~
-::: bash
+:::bash
 # minimum parameter call
 java -jar popte2.jar stat-coverage --ppileup input.ppileup --output coverage-statistics.txt 
 ~~~~~
@@ -308,7 +309,7 @@

 ~~~~~
-::: bash
+:::bash
 # minimum parameter call
 java -jar popte2.jar stat-reads --bam input1.bam --hier tehierarchy.txt --output read-stat.txt
 ~~~~~
@@ -328,7 +329,7 @@
 This step generates statistics about mapped paired end fragments, allowing to estimate the fraction of fragments mapped as proper pair, as discordant pairs that support a TE insertion and as discordant pairs that supporting a structural rearrangement.

 ~~~~~
-::: bash
+:::bash
 # minimum parameter call
 java -jar popte2.jar stat-pairs --bam input1.bam --hier tehierarchy.txt --output read-stat.txt
 ~~~~~

Manual modified by Robert Kofler

Robert Kofler — Fri, 19 Feb 2016 22:34:34 -0000

--- v85
+++ v86
@@ -180,7 +180,7 @@

 ## frequency##

-This step estimates the abundance  of TE insertions and rearrangements  [estimate frequency]
+This step estimates the population frequency  of TE insertions and rearrangements  [estimate frequency]

 ~~~~~~
 :::bash

Manual modified by Robert Kofler

Robert Kofler — Sun, 07 Feb 2016 16:08:43 -0000

--- v84
+++ v85
@@ -129,6 +129,7 @@
 + --dissable-zipped: per default the output is a gzipped ppileup file; by provding this option zipped output may be dissabled
 + --sr-min-dist: minimum distance between paired-end reads to account as structural rearrangements. The inner distance between paired end reads is subject to stochastic variation. However,  distances exceeding *--sr-min-dist* will not be treated as stochastic variations, but rather as structural variation (e.g. inversions, rearangments). **Note** Reads mapping to distinct reference chromosomes are always treated as structural variations (e.g. translocations).  
 + --id-up-quant: upper quantile of inner distance; If for example set to 0.01 the 1% paired end reads with the most extreme inner distance will be ignored. This step is performed after applying *--sr-min-dist*
++ --homogenize-pairs: allows to use identical number of mapped pair ends for all samples, i.e. this option allows to homogenize the number of mapped paired ends; The algorithm first counts the number of informative pairs in all bam files (i.e. pairs supporting a TE, proper pair, pair supporting structural variants), than identifies the smallest number of informative pairs among the samples and finally samples the number of informative pairs in all bam files (on the fly) to the smallest number. The same number of paired ends will thus be used in each sample for generating the ppileup track (introduced with v1.08.02)
 + --detailed-log: provide more detailed help messages
 + --help: show help

@@ -149,6 +150,7 @@
 + --output: a physical pileup file; will be zipped per default; Mandatory
 + --target-coverage: subsample the coverage at all populations and at all sites to the given value; **Note** that sites with insufficient coverage in ANY sample/population will be ignored;  Mandatory
 + --dissable-zipped: per default the output file is zipped; unzipped output may be obtained by providing this option
++ --with-replace: allows to sample with replacement instead of the  default without replace; we recommend the default; (introduced with v1.08.02)
 + --detailed-log: mostly for troubleshooting; more detailed output can be obtained.
 + --help: show a help message

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 17:10:24 -0000

--- v83
+++ v84
@@ -124,16 +124,16 @@
 + --bam: a bam file; Illumina paired end reads mapped to a TE-merged-reference (see above). At least one bam file must be provided
 + --map-qual: the minimum mapping quality for reads mapping to a reference chromosome. Reference chromosomes are recognized as sequences without corresponding entry in the TE-hierarchy (see above). **Note** this restriction does not apply to reads mapping to TE sequence. For such reads a low mapping quality is in fact expected, especially if several slightly diverged sequences are provided for a TE family. 
 + --hier: the TE hierarchy (see above)
-+ --output: the output, which will be a ppileup file; This major innovation introduced with PoPoolationTE2 facilitates an unbiased comparision of TE abundance between samples/populations  [ppileup file]
++ --output: the output, which will be a ppileup file; This innovation introduced with PoPoolationTE2 facilitates an unbiased comparision of TE abundance between samples/populations  [ppileup file]
 + --te-shortcuts: a list of shortcuts for TEs. Per default PoPoolationTE2 computes a shortcut for every TE family present in the hierarchy. This shortcut is than used in the ppileup file. However, a list of shortcuts may also be provided by the user. Such a custom-list needs to meet the following criteria: a.) a shortcut has to be provided for every family in the TE hierarchy. b.) shortcuts must be unique, i.e. no shortcut may be used for two families. Shortcuts are case insensitive. c.) shortcuts must have distinct uppercase and lowercase values. For example '4a' is a valid shortcut (4a != 4A) but '4' is not (4 = 4).   
 + --dissable-zipped: per default the output is a gzipped ppileup file; by provding this option zipped output may be dissabled
-+ --sr-min-dist: minimum distance between paired-end reads as to account as structural rearrangements. The inner distance between paired end reads is subject to stochastic variation. However,  distances between reads exceeding *--sr-min-dist* will not be treated as stochastic variations, but rather as structural variation (e.g. inversions, rearangments) **Note** Reads mapping to distinct reference chromosomes are always considered as structural variations. For example translocations may lead to reads mapping to distinct reference chromosomes.  
++ --sr-min-dist: minimum distance between paired-end reads to account as structural rearrangements. The inner distance between paired end reads is subject to stochastic variation. However,  distances exceeding *--sr-min-dist* will not be treated as stochastic variations, but rather as structural variation (e.g. inversions, rearangments). **Note** Reads mapping to distinct reference chromosomes are always treated as structural variations (e.g. translocations).  
 + --id-up-quant: upper quantile of inner distance; If for example set to 0.01 the 1% paired end reads with the most extreme inner distance will be ignored. This step is performed after applying *--sr-min-dist*
 + --detailed-log: provide more detailed help messages
 + --help: show help

 ## subsamplePpileup ##
-This step allows to subsample a ppileup-file to an uniform coverage. This leads to an uniform power to identify TE insertions **within** as well as **between** samples/populations and thus enables an unbiased comparision of TE abundance.  
+This step allows to subsample a ppileup-file to an uniform coverage, thus homogenizing the power to identify TE insertions **within** as well as **between** samples/populations, which in turn enables an unbiased comparision of TE abundance.  

 ~~~~
@@ -145,14 +145,14 @@

 ** Parameters**

-+ --ppileup: a physical pileup file
-+ --output: a physical pileup file; will be zipped per default
-+ --target-coverage: subsample the coverage at all populations and at all sites to the provided value; **Note** that sites with insufficient coverage in ANY sample/population will be ignored; 
++ --ppileup: a physical pileup file; Mandatory
++ --output: a physical pileup file; will be zipped per default; Mandatory
++ --target-coverage: subsample the coverage at all populations and at all sites to the given value; **Note** that sites with insufficient coverage in ANY sample/population will be ignored;  Mandatory
 + --dissable-zipped: per default the output file is zipped; unzipped output may be obtained by providing this option
 + --detailed-log: mostly for troubleshooting; more detailed output can be obtained.
 + --help: show a help message

-**Note** during subsampling, the physical coverage from the forward and the reverse direction are treated separately. That is, for every genomic sites actually two subsampling steps are performed, one for the forward coverage and one for the reverse coverage. This may result in breaking up support for the absence of a TE insertions into reverse and forward support; See last section of [ppileup file]
+**Detail** during subsampling, the physical coverage from the forward and the reverse direction are treated separately. Thus, for every genomic sites actually two subsampling steps are performed, one for the forward coverage and one for the reverse coverage. This may result in a slightly different [ppileup file]

 ## identifySignatures ##
 This step allows to identify signatures of TE insertions from the ppileup-file, as explained here [signatures of TE insertions]; Signatures will be reported in the [signature file format] 
@@ -168,17 +168,17 @@

 + --ppileup: a physical pileup file [ppileup file]
 + --output: a signature file  [signature file format] 
-+ --mode:  (separate | joint | separateRefine), PopTE2 allows to identify signatures of TE insertions with three different algorithm. With the *separate* algorithm, TEs are identified in each sample separately, independent of the other samples. With the *joint* algorithm  the ppileup tracks of all samples are merged (internally only) and signatures are identified in this merged ppileup. And finally with the *separateRefine* algorithm, signatures are first  identified in each sample separately and than, in a second step, the position of the TE insertions is refined using the joint data set. For illustrated explanations see [signatures of TE insertions]; For more explanation on the three different modes see  [signature modes]
-+ --min-count the minimum average physical coverage in the window for identifying a signature of a TE insertion; for  details see [signatures of TE insertions] 
-+ --signature-window (fixNNNN | minimumSampleMedian | maximumSampleMedian | median):  signatures of TE insertions are identified using a window based approach (see [signatures of TE insertions]). The window size may be specified with this parameter;  With the default, 'median', the median of the inner distance is used within each sample, where every sample may have a different window size.  With the other three options (*fixNNNN*, *minimumSampleMedian*, *maximumSampleMedia*) an identical window-size will be used for samples/populations. *fixNNNN* allows the user to provide a fixed custom winodw size (e.g. fix120 for a window size of 120); the maximum median is used for all samples/populations with *maximumSampleMedian* and the minimum with *minimumSampleMedian*
-+ --min-valley (fixNNNN | minimumSampleMedian | maximumSampleMedian | median) the minimum size of the valley between two consectuive TE insertions; the average coverage of the valley needs to be lower than *--min-count*; for illustrated explanation see [signatures of TE insertions];  default= the same as for *--signature-window*
-+ --chunk-distance PoPoolationTE2 processes the ppileup in chunks, that is sets of ppileup entries. This options helps to avoid excessive memory consumption as may for example arise when loading entire chromosomes into the memory. If no TE insertion is found for *--chunk-distance* multiplied with the median insert size PoPoolationTE2 proceeds with a new chunk. default=5
++ --mode:  (separate | joint), PopTE2 allows to identify signatures of TE insertions with two different algorithm. With the *separate* algorithm, TEs are identified in each sample separately, independent of the other samples. With the *joint* algorithm  the ppileup tracks of all samples are merged (internally only) and signatures are identified from this merged ppileup track.  For illustrated explanations see [signatures of TE insertions]; For more explanation on the two different modes see  [signature modes]
++ --min-count: the minimum average physical coverage in the window for identifying a signature of TE insertions; for  details see [signatures of TE insertions] 
++ --signature-window (fixNNNN | minimumSampleMedian | maximumSampleMedian | median):  signatures of TE insertions are identified using a window based approach (see [signatures of TE insertions]). The window size may be specified with this parameter;  With the default, 'median', the median of the inner distance is used for each sample, where every sample could have a different window size.  With the other three options (*fixNNNN*, *minimumSampleMedian*, *maximumSampleMedia*) an identical window-size will be used for all samples/populations. *fixNNNN* allows the user to provide a fixed custom winodw size (e.g. fix120 for a window size of 120); the maximum median is used for all samples/populations with *maximumSampleMedian* and the minimum with *minimumSampleMedian*
++ --min-valley (fixNNNN | minimumSampleMedian | maximumSampleMedian | median) the minimum size of the valley between two consectuive TE insertions; the average coverage of the valley needs to be lower than *--min-count*; for illustrated explanation see [signatures of TE insertions];  default=[the same as *--signature-window*]
++ --chunk-distance: to avoid excessive memory consumption by loading ppileup tracks for entire chromosomes, PoPoolationTE2 processes the ppileup track in chunks. If TE support lower than *--min-count* is found for *--chunk-distance* multiplied by the median insert size, PoPoolationTE2 proceeds with a new chunk. default=5
 + --detailed-log show a detailed log message
 + --help show help 

 ## frequency##

-This step estimates the abundance  abundance of TE insertions and rearrangements  [estimate frequency]
+This step estimates the abundance  of TE insertions and rearrangements  [estimate frequency]

 ~~~~~~
 :::bash
@@ -207,10 +207,10 @@
 ** Parameters **

 + --input: the signatures to filer; mandatory [signature file format] 
-+ --output: the filtered sigantures; mandatory [signature file format] 
++ --output: the filtered signatures; mandatory [signature file format] 
 + --min-coverage: the minimum average coverage; all samples need to meet this requirement; default=0
 + --max-coverage: the maximum average coverage; all samples need to meet this requirement; default=infinite
-+ --min-count: the minimum average count of the TE; Only entries of the same family and signature direction (forward or reverse) are considered; At least one sample needs to meet this requirement; default=0
++ --min-count: the minimum average count of the given TE;  At least one sample needs to meet this requirement; default=0
 + --max-otherte-count: the maximum allowed average count of other TEs. All samples need to meet this requirement; default=infinite
 + --max-structvar-count: the maximum allowed average count of structural variants (rearrangements); All samples need to meet this requirement; default=infinite
 + --min-fraction: the minimum required frequency of the TE; Only entries of the same family and signature direction (forward or reverse) are conisdered; At least one sample needs to meet this requirement; default=0.0
@@ -220,7 +220,7 @@

 ## pairupSignatures##

-This step pairs matching signatures of TE insertions, generating the final result, a list of identified TE insertions.
+This step pairs matching signatures of TE insertions, generating the final result, a list of TE insertions.

 ~~~~~
 :::bash
@@ -231,7 +231,7 @@
 **Parameters**

 + --signature: signatures of TE insertions that ought to be paired; mandatory
-+ --ref-genome: the temerged-reference used for mapping the reads; this is necessary as PoPoolationTE2 computes the distance between signatures of TE insertions, but poly-N tracts do not count (otherwise we would bias against reference insertions); mandatory 
++ --ref-genome: the TE-merged-reference used for mapping the reads; this is necessary as PoPoolationTE2 computes the distance between signatures of TE insertions, but poly-N tracts should not be considered (otherwise we would bias against reference insertions); mandatory 
 + --hier: the TE hierarchy; mandatory
 + --output: the final result, a list of TE insertions [TE insertion file]; mandatory
 + --min-distance: the minimum distance between valid pairs of signatures; distance is always computed as position-forward-signature minus position-reverse-signature, hence negative values are possible; default=-100
@@ -243,7 +243,7 @@
 # Secondary tasks#

 ## se2pe ##
-This step restores paired end information for separately mapped reads. For example if read_1.fastq and read_2.fastq were mapped separately with *bwa bwasw*, this tool allows to generate a paired end bam file.
+This step restores paired end information for separately mapped reads. For example if read_1.fastq and read_2.fastq were mapped separately with *bwa bwasw*, this subtask allows to generate a merged bam file with paired end information (e.g. the flags will be set properly, and the position of the mates will be updated).

 ~~~~
 :::bash
@@ -257,13 +257,13 @@
 + --fastq2: the second fastq read; may be zipped; mandatory
 + --bam1: the mapping result for the first read; may be sam or bam (not sorted!); mandatory
 + --bam2: the mapping result for the second read; may be sam or bam (not sorted!); mandatory
-+ --output: the mapping result for both reads, with paired-end information restored (e.g. the flags properly set, and the position of the mates updated); may be sam or bam; mandatory
++ --output: the mapping result for both reads, with paired end information restored (e.g. the flags properly set, and the position of the mates updated); may be sam or bam; mandatory
 + --sort: set this flag for obtaining a sorted output file; PoPoolationTE2 requires sorted files for generating the ppileup file
 + --index: Create an index for the output file
 + --help: Show a help message

 ## updateStrand##
-Per default, the PoPoolationTE2 pipeline does not estimate the strand of a TE insertion (i.e sense or antisense). If the strand information is required, this step may be used.
+Per default, the PoPoolationTE2 pipeline does not estimate the strand of a TE insertion (i.e sense or antisense). If the strand information is desired this step may be used.

 ~~~~~
 :::bash
@@ -273,12 +273,12 @@

 **Parameters**

-+ --bam: a  bam file of paired-end reads mapped to the TE-merged-reference; may be provided multiple times;  must be in the same order as used for generating the ppileup file mandatory
++ --bam: a  bam file of paired-end reads mapped to the TE-merged-reference; may be provided multiple times;  must be in the same order as was used for generating the ppileup file mandatory
 + --signature: the signatures for which the strand of the TE insertions should be estimated
 + --output: signatures with strand information
 + --hier: the TE hierarchy
-+ --map-qual: the minimum mapping quality of reads mapping to a reference chromosome (not to a TE; here the minimum mapping quality is of course 0)
-+ --max-disagreement: different paired end fragments may disagree on the strand of the TE insertion. If the provided maximum disagreement of paired-end fragments is exceeded the strand will be unknown (character point). For example 0.1 means that at the most 10% of the reads may provide conflicting strand information. mandatory
++ --map-qual: the minimum mapping quality of reads mapping to a reference chromosome (not to a TE)
++ --max-disagreement: different paired end fragments may disagree on the strand of the TE insertion. If the provided maximum disagreement of paired end fragments is exceeded the strand will be unknown (character point). For example 0.1 means that at the most 10% of the reads may provide conflicting strand information. mandatory
 + --sr-mindist: minimum inner distance for structural rearrangements; if possible provide the same value as used for generating the ppileup; default=10000
 + --id-up-quant: ignore paired-end fragments with an insert size exceeding this fraction;  if possible provide the same value as used for generating the ppileup; default=0.01
 + --detailed-log: show a more detailed logging message
@@ -302,7 +302,7 @@

 ## stat-reads##
-This step allows to generate statistics about the reads mapped to diverse TEs. For example it allows to estimate the fraction of reads mapped to each TE family.
+This step allows to generate statistics about the reads mapped to TE sequences. For example the fraction of reads mapping to each TE family may be computed.

 ~~~~~
@@ -313,32 +313,32 @@

 **Parameters**

-+ --bam:  a  bam file of paired-end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
++ --bam:  a  bam file of paired end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
 + --map-qual: the minimum mapping quality of a read mapping to a TE (!); default=0
 + --hier: the TE hierarchy; mandatory
 + --output: the statistics of reads mapping to TEs;  for the format of the output file see [diverse output files]; mandatory 
++ --detailed-log: show a more detailed logging message
++ --help: show the help
+
+
+
+## stat-pairs
+This step generates statistics about mapped paired end fragments, allowing to estimate the fraction of fragments mapped as proper pair, as discordant pairs that support a TE insertion and as discordant pairs that supporting a structural rearrangement.
+
+~~~~~
+::: bash
+# minimum parameter call
+java -jar popte2.jar stat-pairs --bam input1.bam --hier tehierarchy.txt --output read-stat.txt
+~~~~~
+
+
+
+
+**Parameters**
+
++ --bam:  a  bam file of paired end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
++ --map-qual: the minimum mapping quality of a read mapping to a TE (!); default=0
++ --hier: the TE hierarchy; mandatory
++ --output: the statistics of paired end fragments;  for the format of the output file see [diverse output files]; mandatory 
 + --detailed-log: show a more detailed logger message
 + --help: show the help
-
-
-
-## stat-pairs
-This step generates statistics about paired-end fragments where one read maps to a TE and the other to a reference chromosome.
-
-~~~~~
-::: bash
-# minimum parameter call
-java -jar popte2.jar stat-pairs --bam input1.bam --hier tehierarchy.txt --output read-stat.txt
-~~~~~
-
-
-
-
-**Parameters**
-
-+ --bam:  a  bam file of paired-end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
-+ --map-qual: the minimum mapping quality of a read mapping to a TE (!); default=0
-+ --hier: the TE hierarchy; mandatory
-+ --output: the statistics of TE-informative paired end fragments;  for the format of the output file see [diverse output files]; mandatory 
-+ --detailed-log: show a more detailed logger message
-+ --help: show the help

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 14:58:13 -0000

--- v82
+++ v83
@@ -100,18 +100,18 @@
     * se2pe restore paired end information for individually mapped reads (e.g. bwasw) output files
     * updatestrand estimate strand of signatures of TE insertions
     * stat-coverage calculate physical coverage statistics; helps to decide optimal target coverage for subsampling
-    * stat-reads compute the mapping statistics; reads mapping to different reference chromosomes and TEs
-    * stat-pairs compute the paired-end statistics; reads supporting a TE insertion
+    * stat-reads compute the mapping statistics; statistics about reads mapping to different reference chromosomes and TEs
+    * stat-pairs compute the paired end statistics; statistics about reads supporting a TE insertion
     * version print the version number

 ## Workflow##
-Here is an overview of the workflow for using PopoolationTE2. Mandatory steps are shown with a full line and optional steps with a dashed line. For example one input file (.bam) is required but additional ones may be provided. Files are shown in eliptic frames and steps performed with PoPoolationTE2 in rectangular frames.
+Here is an overview of the workflow for using PopoolationTE2. Mandatory steps are shown with a full line and optional steps with a dashed line. For example one input file (.bam) is required but additional ones could be provided. Files are shown in eliptic frames and steps performed with PoPoolationTE2 in rectangular frames.

 [[img src=popte2_flow.png]]

 # Main task#
 ## ppileup##
-This step allows to generate a ppileup file for one or multiple samples [ppileup file]
+This step allows to generate a ppileup (physical pileup) file for one or multiple samples [ppileup file]

 ~~~~
 :::bash
@@ -122,10 +122,10 @@
 ** Parameters**

 + --bam: a bam file; Illumina paired end reads mapped to a TE-merged-reference (see above). At least one bam file must be provided
-+ --map-qual: the minimum mapping quality of the reads mapping to a reference chromosome. Reference chromosomes have sequence IDs without corresponding entry in the TE-hierarchy (see above). **Note** this does not apply to reads mapping to TE sequence. For such reads a low mapping quality is in fact expected, especially if several related sequences for a single TE family have been provided. 
++ --map-qual: the minimum mapping quality for reads mapping to a reference chromosome. Reference chromosomes are recognized as sequences without corresponding entry in the TE-hierarchy (see above). **Note** this restriction does not apply to reads mapping to TE sequence. For such reads a low mapping quality is in fact expected, especially if several slightly diverged sequences are provided for a TE family. 
 + --hier: the TE hierarchy (see above)
-+ --output: the output, a ppileup file; This major innovation introduced in PopTE2 facilitates a simple identification of TEs and an unbiased comparision of TE abundance between samples/populations  [ppileup file]
-+ --te-shortcuts: a list of shortcuts for TEs. Per default PoPTE2 computes a shortcut for every TE family in the hierarchy. This shortcut is than used in the ppileup file. However, a list of shortcuts may also be provided by the user. Such a custom-list needs to meet the following criteria: a.) a shortcut has to be provided for every family in the TE hierarchy. b.) shortcuts must be unique, i.e. no shortcut may be used for two families. Shortcuts are case insensitive. c.) shortcuts must have distinct uppercase and lowercase values. For example '4a' is a valid shortcut (4a != 4A) but '4' is not (4 = 4).   
++ --output: the output, which will be a ppileup file; This major innovation introduced with PoPoolationTE2 facilitates an unbiased comparision of TE abundance between samples/populations  [ppileup file]
++ --te-shortcuts: a list of shortcuts for TEs. Per default PoPoolationTE2 computes a shortcut for every TE family present in the hierarchy. This shortcut is than used in the ppileup file. However, a list of shortcuts may also be provided by the user. Such a custom-list needs to meet the following criteria: a.) a shortcut has to be provided for every family in the TE hierarchy. b.) shortcuts must be unique, i.e. no shortcut may be used for two families. Shortcuts are case insensitive. c.) shortcuts must have distinct uppercase and lowercase values. For example '4a' is a valid shortcut (4a != 4A) but '4' is not (4 = 4).   
 + --dissable-zipped: per default the output is a gzipped ppileup file; by provding this option zipped output may be dissabled
 + --sr-min-dist: minimum distance between paired-end reads as to account as structural rearrangements. The inner distance between paired end reads is subject to stochastic variation. However,  distances between reads exceeding *--sr-min-dist* will not be treated as stochastic variations, but rather as structural variation (e.g. inversions, rearangments) **Note** Reads mapping to distinct reference chromosomes are always considered as structural variations. For example translocations may lead to reads mapping to distinct reference chromosomes.  
 + --id-up-quant: upper quantile of inner distance; If for example set to 0.01 the 1% paired end reads with the most extreme inner distance will be ignored. This step is performed after applying *--sr-min-dist*

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 14:47:29 -0000

--- v81
+++ v82
@@ -53,29 +53,33 @@
 P-element     P-element TIR
 ~~~~~

-The following walkthrough demonstrates how the TE-merged-reference and the TE hierarch can be generated
+The following walkthrough demonstrates how the TE-merged-reference and the TE hierarchy can be generated

 **[WalkthroughPreparatoryWork]**

 ## Mapping PE reads to TE-merged-reference##
+
+We recommend to use a local alignment algorithm (bwa sw, bwa mem, bowtie --local) for mapping reads to the TE-merged-reference. PoPoolationTE2 requires a **sorted** bam file as input. **Note** that for every sample a separate bam file is required (read groups are not supported).  If you use *bwa mem* it is important that you provide the *-m* option, which ensures that secondary alignments are marked as such.
+
+The following walkthrough hsow these bam files may be generated [Walkthrough]

 # First steps with PoPoolationTE2 #
 ## Dowload ##

 PoPoolationTE2 is available as Java jar file for download here: https://sourceforge.net/projects/popoolation-te2/
-Since PoPoolationTE2 is implemented in Java it can be run on most operating systems including Windows, Mac OS X, and Linux
+Since it is implemented in Java it can be run on most operating systems including Windows, Mac OS X, and Linux.

 ## Run PoPoolationTE2

-PopTE2 supports variable tasks. Display all possible tasks by starting PopTE2 without any parameters
+PoPoolationTE2 supports variable tasks. Display all possible tasks by starting PopTE2 without providing any parameters

 ~~~~
 :::bash
 java  -jar popte2.jar
 ~~~~

-Than run the task of interest by providing the ID of the task as the first argument. For example to display the version number run 
+Than run any subtask by providing the name of the task as  first argument. For example in case you want to display the version number  

 ~~~~
 :::bash
@@ -83,17 +87,17 @@
 ~~~~

 ## List of supported tasks##
-PoPoolationTE2 supports a several subtasks, where the name of the subtask needs to be provided as first parameter.
-In general several Main tasks and a couple of Secondary tasks are supported.
+PoPoolationTE2 supports several subtasks.  The name of the subtask needs to be provided as first parameter. We distinguish  *Main tasks* (necessary for an unbiased comparison of TE abundance) and  *Secondary tasks* (helpful, but not essential).

 + Main tasks
     * ppileup Generate a ppileup file
     * subsamplePpileup subsample ppileup files to an uniform coverage
-    * identifySignatures identify signatures of TE insertions frequency estimate population frequencies for signatures of TE insertions
+    * identifySignatures identify signatures of TE insertions
+    * frequency estimate population frequencies for signatures
     * filterSignatures filter signatures of TE insertions
     * pairupSignatures pair up signatures of TE insertions to obtain TE insertions
 + Secondary tasks
-    * se2pe obtain a paired-end bam-file for individually mapped (e.g. bwasw) output files
+    * se2pe restore paired end information for individually mapped reads (e.g. bwasw) output files
     * updatestrand estimate strand of signatures of TE insertions
     * stat-coverage calculate physical coverage statistics; helps to decide optimal target coverage for subsampling
     * stat-reads compute the mapping statistics; reads mapping to different reference chromosomes and TEs

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 14:37:22 -0000

--- v80
+++ v81
@@ -38,7 +38,9 @@
 **Note** the base substitutions in the roo sequences. PoPoolation TE2 allows to provide multiple sequences for every TE family. Thereby also diverged TE copies may be identifed. The hierarchy (see below) allows to assign these different sequences to one family (*roo*).

 ## TE hierarchy##
-The TE hierarchy serves two purposes. First it allows to distinguish TE sequences from rerference chromosomes, where every sequence in the TE-merged-reference (see above) with an corresponding entry in the hierarchy is considered a TE sequence and every entry without a reference chromosome. And second, it allows to assign multiple slightly diverged sequences to one family. This is necessary as  for many TE families, the different insertions frequently have more or less diverged sequences. In PopTE2 all reads mapping to any of these diverged sequences are treated as mapping to the family, and this is achieved by the hierarchy. Using the above example any read mapping to roo_from_2L, roo_from_2R, roo_consensus is treated as mapping to the roo family if, for example, the following hierarchy is used.
+The TE hierarchy serves two purposes. First it allows to distinguish TE sequences from rerference chromosomes (see above) , and second it allows to assign multiple slightly diverged sequences to one family.
+Some TE families have highly diverged copies (e.g INE-1 in Drosophila, with up to 10% sequence divergence) and this feature ensures that even highly diverged copies could be identifed. 
+Based on the hierachy, all reads mapping to any of these diverged sequences are recognized as mapping to the same family. Using the above example any read mapping to *roo_from_2L*, *roo_from_2R*, *roo_consensus* is treated as mapping to the *roo* family, provided the following hierarchy is used.

 ~~~~~
 id            family    order

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 14:29:47 -0000

--- v79
+++ v80
@@ -5,16 +5,16 @@

 # Prerequisites#

-*  Java
-* a short read mapper, like BWA SW
+* Java
+* a short read mapper, like BWA SW (we recommend a local alignment algorithm)
 * a modified reference genome, the TE-merged-reference (see below)
 * a TE hierarchy (see below)
-* Illumina paired-end reads for at least one sample, where samples could be pooled populations, tissues or sequenced individuals
+* paired-end reads for at least one sample, where samples could be pooled populations, tissues or sequenced individuals

 # Preparatory work#
-First it is necessary to create a TE-merged-reference and a TE-hierarchy. Than Illumina paired-end reads need to be mapped to the TE-merged-reference
+First it is necessary to create a TE-merged-reference and a TE-hierarchy. Next, paired ends need to be mapped to the TE-merged-reference.
 ## TE-merged-reference##
-The TE-merged-reference, which consists of i) the repeat-masked reference genome and ii) TE sequences.  TE sequences could either be consensus sequences of TE families (e.g. from RepBase) or the sequences that have been masked in the reference genome or both.
+The TE-merged-reference consists of i) the repeat-masked reference genome and ii) TE sequences.  TE sequences could either be consensus sequences of TE families (e.g. from RepBase) or the sequences which have been masked in the reference genome or both.
 The TE-merged-reference is in the fasta format. 
 An example:

@@ -33,9 +33,9 @@

 Here, *2L* and *2R* are reference chromosomes. *roo_from_2L* is the sequence of the roo transposable element that was masked in *2L* (*NNNNNN* in *2L*), *roo_from_2R* is the sequence of roo in *2R* and *roo_consensus* is the consenus sequence of roo.  
-PopTE2 distinguishes TEs from reference chromosomes based on the TE-hierarch (see below). Every sequence, where the ID (in the fasta header following the >) has an corresponding entry in the TE-hierarchy is considered a TE, every sequence where the ID is not found in the hierarchy is considered  a reference chromosome.
-
-**Note** the base substitutions in the three sequences of roo. PoPoolation TE2 allows to provide multiple sequences for every TE family. Therefore also more divereged copies can be identifed. The hierarchy (see below) allows to assign these different sequences to one family, in this case roo.
+PopTE2 needs to distinguish TE sequences from reference chromosomes. This is accomplished by using the TE-hierarchy (see below). Every sequence in the fasta file with a corresponding entry in the TE-hierarchy is considered a TE while every sequence without entry is considered a reference chromosome.
+
+**Note** the base substitutions in the roo sequences. PoPoolation TE2 allows to provide multiple sequences for every TE family. Thereby also diverged TE copies may be identifed. The hierarchy (see below) allows to assign these different sequences to one family (*roo*).

 ## TE hierarchy##
 The TE hierarchy serves two purposes. First it allows to distinguish TE sequences from rerference chromosomes, where every sequence in the TE-merged-reference (see above) with an corresponding entry in the hierarchy is considered a TE sequence and every entry without a reference chromosome. And second, it allows to assign multiple slightly diverged sequences to one family. This is necessary as  for many TE families, the different insertions frequently have more or less diverged sequences. In PopTE2 all reads mapping to any of these diverged sequences are treated as mapping to the family, and this is achieved by the hierarchy. Using the above example any read mapping to roo_from_2L, roo_from_2R, roo_consensus is treated as mapping to the roo family if, for example, the following hierarchy is used.

Manual modified by Robert Kofler

Robert Kofler — Wed, 03 Feb 2016 12:13:58 -0000

--- v78
+++ v79
@@ -9,7 +9,7 @@
 * a short read mapper, like BWA SW
 * a modified reference genome, the TE-merged-reference (see below)
 * a TE hierarchy (see below)
-* Illumina paired-end reads for a sample (e.g.: pooled population, tissue, individual specimen)
+* Illumina paired-end reads for at least one sample, where samples could be pooled populations, tissues or sequenced individuals

 # Preparatory work#
 First it is necessary to create a TE-merged-reference and a TE-hierarchy. Than Illumina paired-end reads need to be mapped to the TE-merged-reference