Recent changes to Walkthrough_species_tool_compatibility

Walkthrough_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 13:02:19 -0000

--- v14
+++ v15
@@ -3,17 +3,16 @@
 # Introduction
 Assuming, you want to identifiy TE insertions in a species of interest, say *Dunkleosteus terrelli*, but you do not know whether a given tool provides reliable results for this species or whether the available genomic resource are suitable.

-Basically the performance of an approach for TE identification will depend on three factors
+Basically the performance of an approach for TE identification will depend on three factors:

  * the reference genome (e.g. TE identifcation will be difficult for highly repetitive genomes)
  * the TE sequences (e.g. TE identification may be difficult when TE sequences have a high sequence similarity)
  * the tool (some algorithm are just better than others; also there may be interactions between tools and genomic resources, for example if a tool is very sensitive to repetitive regions in genomes)

-This raises the question which tool performs the best for a given combination of reference genome and TE sequences. 
-To address this question it is necessary to simulate a TE landscape with known insertions and than test which tool best reproduces the simulated TE landscape.
+Thus it is necessary to evaluate the performance of an approach for TE identification.
+To address this question it is necessary to simulate a TE landscape with known insertions (using the genomic resources of the species of interest) and than test which tool best reproduces the simulated TE landscape.

-
-In this walkthrough we demonstrate how to simulate a  TE landscape and Illumina paired end reads for a species of interest. These reads may than be used  to evaluate the performance of tools for TE identification.
+In this walkthrough we demonstrate how to simulate a TE landscape and Illumina paired end reads for a species of interest. These reads may than be used  to evaluate the performance of an approach for TE identification.

 # Walkthrough:
@@ -38,11 +37,16 @@
 ~~~~~~

 ##build the population genome
+
+Next we build the population genome
+
 ~~~~~
 python build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg
 ~~~~~

 ## simulate Illumina paired end reads:
+
+Based on the population genome we simulate Illumina paired-end reads:

 ~~~~~
 python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
@@ -50,10 +54,10 @@

 ## next steps

-The generated reads may than be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP. 
-An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]
+The obtained Illumina paired-end reads may  be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP. 
+An example, demonstrating TE identification with the simulated reads and comparision between the expected and observed TE landscape, can be found here: [Validation_Pop2]

-The TE identification pipeline differs substantially among the tools for TE identificaiton from NGS data. Moreover the pipeline may also change substantially with the version of the tool. For this reason  we refere to the manuals of the respective tool for details. For example the following tools may be used with SimulaTE data:
+The TE identification pipelines differ substantially among the tools for TE identificaiton from NGS data. Moreover the pipeline may also change substantially with the version of the tool. Hence, we refere to the manual of the respective tool for details. Just to show a few examples, the following tools may be used with SimulaTE data:

  * PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
  * T-LeX2 https://academic.oup.com/nar/article/43/4/e22/2410985/T-lex2-genotyping-frequency-estimation-and-re
@@ -64,10 +68,10 @@
  * Jitterbug https://www.ncbi.nlm.nih.gov/pubmed/26459856

 #Note
-In this walkthrough we simulated Illumina paired end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate
+In this walkthrough we simulated Illumina paired-end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate

-* Illumina paired end data when individuals of a population are sequenced separately
+* Illumina paired-end data for sequencing individuals of a population separately
 * PacBio data when individuals are sequenced as pools
 * PacBio data when individuals are sequenced separately
-* Illumina single end data when individuals are sequenced as pools
-* Illumina single end data when indviduals are sequenced separtely
+* Illumina single-end data when individuals are sequenced as pools
+* Illumina single-end data when indviduals are sequenced separtely

Walkthrough_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 12:51:32 -0000

--- v13
+++ v14
@@ -1,13 +1,13 @@
 [TOC]

 # Introduction
-Assuming, you want to identifiy TE insertions in a species of interest, say *Dunkleosteus terrelli*, but you do not know whether a given tool provides reliable results for this species. 
+Assuming, you want to identifiy TE insertions in a species of interest, say *Dunkleosteus terrelli*, but you do not know whether a given tool provides reliable results for this species or whether the available genomic resource are suitable.

-Basically the performance of a tool for TE identification will depend on three factors
+Basically the performance of an approach for TE identification will depend on three factors

  * the reference genome (e.g. TE identifcation will be difficult for highly repetitive genomes)
  * the TE sequences (e.g. TE identification may be difficult when TE sequences have a high sequence similarity)
- * the tool (some algorithm are just better than others)
+ * the tool (some algorithm are just better than others; also there may be interactions between tools and genomic resources, for example if a tool is very sensitive to repetitive regions in genomes)

 This raises the question which tool performs the best for a given combination of reference genome and TE sequences. 
 To address this question it is necessary to simulate a TE landscape with known insertions and than test which tool best reproduces the simulated TE landscape.

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 14:45:57 -0000

--- v12
+++ v13
@@ -53,7 +53,7 @@
 The generated reads may than be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP. 
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

-TE identification requires different steps with each tool, so we refere to the manuals of these tools for details. For example the following tools may be used with SimulaTE data:
+The TE identification pipeline differs substantially among the tools for TE identificaiton from NGS data. Moreover the pipeline may also change substantially with the version of the tool. For this reason  we refere to the manuals of the respective tool for details. For example the following tools may be used with SimulaTE data:

  * PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
  * T-LeX2 https://academic.oup.com/nar/article/43/4/e22/2410985/T-lex2-genotyping-frequency-estimation-and-re

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 14:44:23 -0000

--- v11
+++ v12
@@ -53,6 +53,16 @@
 The generated reads may than be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP. 
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

+TE identification requires different steps with each tool, so we refere to the manuals of these tools for details. For example the following tools may be used with SimulaTE data:
+
+ * PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
+ * T-LeX2 https://academic.oup.com/nar/article/43/4/e22/2410985/T-lex2-genotyping-frequency-estimation-and-re
+ * TEMP https://www.ncbi.nlm.nih.gov/pubmed/24753423
+ * LoRTE https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5385071/
+ * Retroseq https://www.ncbi.nlm.nih.gov/pubmed/23233656
+ * TE-Tracker https://www.ncbi.nlm.nih.gov/pubmed/25408240
+ * Jitterbug https://www.ncbi.nlm.nih.gov/pubmed/26459856
+
 #Note
 In this walkthrough we simulated Illumina paired end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 14:38:54 -0000

--- v10
+++ v11
@@ -48,3 +48,16 @@
 python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
 ~~~~~

+## next steps
+
+The generated reads may than be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP. 
+An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]
+
+#Note
+In this walkthrough we simulated Illumina paired end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate
+
+* Illumina paired end data when individuals of a population are sequenced separately
+* PacBio data when individuals are sequenced as pools
+* PacBio data when individuals are sequenced separately
+* Illumina single end data when individuals are sequenced as pools
+* Illumina single end data when indviduals are sequenced separtely

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 14:34:45 -0000

--- v9
+++ v10
@@ -34,11 +34,17 @@
 In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a walkthrough demonstrating how to generate custom TE landscapes see  [Walkthrough]

 ~~~~~~
-python ~/dev/simulate/define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd 
+python define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd 
 ~~~~~~

 ##build the population genome
 ~~~~~
-python ~/dev/simulate/build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg
+python build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg
 ~~~~~

+## simulate Illumina paired end reads:
+
+~~~~~
+python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
+~~~~~
+

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 14:04:31 -0000

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 13:39:15 -0000

--- v7
+++ v8
@@ -27,14 +27,18 @@

 ~~~~~
 RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseq-clean-ml100noS4.fasta 2R.fasta
-python remove-N.py 2R.fasta.masked >chassis.fasta
+python remove-N.py 2R.fasta.masked >2R.clean.fasta
 ~~~~~

 ## generate a TE landscape
-In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a demonstration on building more complex landscapes see here:
+In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a walkthrough demonstrating how to generate custom TE landscapes see  [Walkthrough]

 ~~~~~~
 python ~/dev/simulate/define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd 
 ~~~~~~

+##build the population genome
+~~~~~
+python ~/dev/simulate/build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg
+~~~~~

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 13:36:08 -0000

--- v6
+++ v7
@@ -30,4 +30,11 @@
 python remove-N.py 2R.fasta.masked >chassis.fasta
 ~~~~~

+## generate a TE landscape
+In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a demonstration on building more complex landscapes see here:

+~~~~~~
+python ~/dev/simulate/define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd 
+~~~~~~
+
+

Walkthroug_species_tool_compatibility modified by Robert Kofler

Robert Kofler — Tue, 11 Jul 2017 13:30:23 -0000

--- v5
+++ v6
@@ -4,9 +4,10 @@
 Assuming, you want to identifiy TE insertions in a species of interest, say *Dunkleosteus terrelli*, but you do not know whether a given tool provides reliable results for this species.

 Basically the performance of a tool for TE identification will depend on three factors
-* the reference genome (e.g. TE identifcation will be difficult for highly repetitive genomes)
-* the TE sequences (e.g. TE identification may be difficult when TE sequences have a high sequence similarity)
-* the tool (some algorithm are just better than others)
+
+ * the reference genome (e.g. TE identifcation will be difficult for highly repetitive genomes)
+ * the TE sequences (e.g. TE identification may be difficult when TE sequences have a high sequence similarity)
+ * the tool (some algorithm are just better than others)

 This raises the question which tool performs the best for a given combination of reference genome and TE sequences. 
 To address this question it is necessary to simulate a TE landscape with known insertions and than test which tool best reproduces the simulated TE landscape.
@@ -22,11 +23,11 @@
 * sequences of the TEs that should be identified; this could be consensus sequences of the TE families present in the species of interest; In this walkthrough we use the consensus sequences of *Drosophila melanogaster* TEs: https://sourceforge.net/projects/simulates/files/walkthrough-species/teseq-clean-ml100noS4.fasta/download

 ## Mask the TEs
-We need to be able to build arbitrary complex TE landscapes, TE insertions already present in the reference genome would interfere with this process. Thus  we need to mask the TE sequences in the refence genome. We use RepeatMasker to mask all TEs with the character *N* and than a custom script to remove all *N*s from the sequence.
+We need to be able to build arbitrary complex TE landscapes, TE insertions already present in the reference genome would interfere with this process. Thus  we need to mask the TE sequences in the refence genome. We use RepeatMasker to mask all TEs with the character *N* and than a custom script to remove all *N*s from the sequence: https://sourceforge.net/projects/simulates/files/walkthrough-species/remove-N.py/download

 ~~~~~
 RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseq-clean-ml100noS4.fasta 2R.fasta
-python remove-N.py 2R.fasta.masked >2R.cleaned.fasta
+python remove-N.py 2R.fasta.masked >chassis.fasta
 ~~~~~