Recent changes to Walkthrough

Walkthrough modified by Robert Kofler

Robert Kofler — Mon, 11 Dec 2017 10:42:37 -0000

--- v28
+++ v29
@@ -9,7 +9,7 @@
 ## Download the data:

 * chassis, i.e. the sequence into which TEs will be inserted: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
-* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.txt/download
+* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download

 ## generating the population genome definition (pgd)

@@ -56,7 +56,7 @@
 ## Download the data:

 * chassis, i.e. the sequence into which TEs will be inserted: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
-* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download
+* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download

 ## Generate an empty template of a TE landscape

Walkthrough modified by Robert Kofler

Robert Kofler — Mon, 11 Dec 2017 10:39:51 -0000

--- v27
+++ v28
@@ -9,7 +9,7 @@
 ## Download the data:

 * chassis, i.e. the sequence into which TEs will be inserted: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
-* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download
+* TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.txt/download

 ## generating the population genome definition (pgd)

Walkthrough modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 12:24:08 -0000

--- v26
+++ v27
@@ -3,7 +3,7 @@
 # Walkthrough 1: Random insertions; Pool-Seq; Illumina paired-end sequencing

 ## Introduction
-First a short walkthrough that demonstrates how to generate a random TE landscape and simulate Illumina paired-end reads for this landscape.
+This short walkthrough demonstrates how to generate a random TE landscape and how to simulate Illumina paired-end reads for this landscape.
 We simulate  1000 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position, family, strand and population frequency (between a range of 0.1 and 0.9). Finally we simulate Illumina paired-end data for a pooled population (Pool-Seq). 

 ## Download the data:
@@ -13,6 +13,7 @@

 ## generating the population genome definition (pgd)

+first we generate a pgd-file, i.e. a description of the TE landscape using our DSL (for details see https://sourceforge.net/p/simulates/wiki/Home/#manual)

 ~~~~~
 python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd
@@ -21,6 +22,8 @@
 **Note**: we use a minimum distance of 500 between adjacent TE insertions

 ## build the population genome
+
+based on the pgd-file we generate the population genome

 ~~~~~~
 python build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg
@@ -34,8 +37,9 @@

 ## next steps

-The generated reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP. 
-The reads may be used to i) validate tools for TE identification and ii)  test whether a given tool is suitable for TE identification in a species of interest.
+The obtained reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP. 
+The reads may be used to i) validate tools for TE identification and ii)  test whether a given tool is suitable for TE identification in a species of interest (a particular set of genomic resources).
+
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

 * PoPoolationTE2 https://academic.oup.com/mbe/article/33/10/2759/2925581/PoPoolationTE2-Comparative-Population-Genomics-of
@@ -46,7 +50,7 @@
 #Walkthrough 2: Manual insertions; Individual sequencing; PacBio sequencing

 ## Introduction
-Next a short walkthrough that demonstrates how to generate a custom TE landscape and simulate PacBio reads for this landscape.
+Next we show how to generate a custom TE landscape and simulate PacBio reads for this landscape.
 We simulate  5 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position. We manually specifiy the family the strand and the population frequency. Finally we simulate PacBio reads sequencing all individuals separately (assuming a diploid organism). 

 ## Download the data:
@@ -55,12 +59,17 @@
 * TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download

 ## Generate an empty template of a TE landscape
+
+First we generate an empty pgd-file where we manually fill in the details for the TE insertions.
+
 ~~~~~
 python define-landscape_template.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 5 --output custom-landscape.pgd
 ~~~~~

 ## Define the landscape
-Following we show the empty template landscape *custom-lanscape.pgd* 
+
+Following the empty template landscape *custom-lanscape.pgd* 
+
 ~~~~~~
 1=$1     # M14653
 2=$2     # DME9736
@@ -75,9 +84,11 @@
 298778 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 439415 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 ~~~~~~
-The first part are the definition of the TE sequences and the second part specifies TE insertions in the 100 individuals, where a star indicates no insertion in a given individual. For more details on the population genome definition file see [describing_TE_landscapes] and [describing_TE_sequences]

-Now we edit the file and specifiy TE insertions as in the following example (dont forget to safe it):
+The first part defines the TE sequences and the second part specifies TE insertions in the 100 individuals, where a star indicates no insertion in a given haploid genome. For more details on the population genome definition file see [describing_TE_landscapes] and [describing_TE_sequences]
+
+Next we edit the file (using any text editor of choice) and specifiy TE insertions similarly as in the following example (don't forget to safe your changes):
+
 ~~~~~~
 # Chasis 2R; Length 1000000 nt
 mariner=$1     # M14653
@@ -96,8 +107,8 @@
 439415 * idefix * * idefix * * * mariner * idefix * * * * * * * * * * mariner * * * * * * * * * mariner * * * * * * * * * * * mariner mariner * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 ~~~~~~

-**Note** that we renammed some insertions, like *1* into *mariner*; this was just done for convenience and is not required; dealing with *mariner* may just be more intuitive than dealing with *1*
-**Note** that we generated two new insertions: *nest* is mariner  nested within OSV ; *multinest* are  several nested insertions within idefix; for details on our DSL for describing TE sequences see [describing_TE_sequences]
+**Note** that we re-named some insertions, like *1* into *mariner*; this was just done for convenience and is not required; dealing with *mariner* is just more intuitive than dealing with *1*
+**Note** that we generated two new insertions: *nest* is a mariner  nested within OSV ; *multinest* are  several nested insertions within idefix; for details on our DSL for describing TE sequences see [describing_TE_sequences]

 * insertion at position 150985: five individuals have a *nest* insertions; four on the plus strand one on the minus strand
 * insertion at position 158736: three individuals have a *multinest* insertion on the minus strand
@@ -110,6 +121,9 @@

 ## build population genome
+
+Based on our pgd-file we than build the population genome:
+
 ~~~~~
 python ~/dev/simulate/build-population-genome.py --pgd custom-landscape.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output custom-landscape.pg
 ~~~~~
@@ -117,7 +131,7 @@
 ## simulate the reads
 PacBio reads frequently have a bimodal read length distribution; In this example we use the following read length distribution: https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download

-The simulated reads will have the following length distribution
+which is visualized here:
 [[img src=rld.png ]]

 ~~~~~

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 07 Jul 2017 13:26:34 -0000

--- v25
+++ v26
@@ -38,7 +38,7 @@
 The reads may be used to i) validate tools for TE identification and ii)  test whether a given tool is suitable for TE identification in a species of interest.
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

-* PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
+* PoPoolationTE2 https://academic.oup.com/mbe/article/33/10/2759/2925581/PoPoolationTE2-Comparative-Population-Genomics-of
 * TEMP https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066757/

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 07 Jul 2017 13:24:31 -0000

--- v24
+++ v25
@@ -37,6 +37,9 @@
 The generated reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP. 
 The reads may be used to i) validate tools for TE identification and ii)  test whether a given tool is suitable for TE identification in a species of interest.
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]
+
+* PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
+* TEMP https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066757/

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 07 Jul 2017 13:23:05 -0000

--- v23
+++ v24
@@ -34,7 +34,8 @@

 ## next steps

-The generated reads may be used as input for a tool identifying TE insertions, such as PoPoolationTE2 or TEMP. The reads may therefore be used to i) validate tools for TE identification and ii) to test if a tool is suitable for TE identification in a species of interest.
+The generated reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP. 
+The reads may be used to i) validate tools for TE identification and ii)  test whether a given tool is suitable for TE identification in a species of interest.
 An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

@@ -135,4 +136,6 @@
 ~~~~~

 ## next steps
+The generated reads may be used as input for tools identifying TE insertions using PacBio data, such as LoRTE https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-017-0088-x
+The reads may be used to i) validate tools for TE identification and ii) test whether a given tool is suitable for TE identification in a species of interest.

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 07 Jul 2017 13:19:09 -0000

--- v22
+++ v23
@@ -122,8 +122,16 @@

 Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (*--deletion-fraction*); We have 100 haploid genomes thus 50 diploid individuals will be simulated (always two consecutive entries from the pgd-file). We simulate 10.000 reads for each diploid individual. 

+This command will generate the following 50 read files:
+
 ~~~~~
-files...
+-rw-r--r--  1 robertkofler  staff  113067241 Jul  7 14:05 reads1.fasta
+-rw-r--r--  1 robertkofler  staff  113444056 Jul  7 14:06 reads2.fasta
+-rw-r--r--  1 robertkofler  staff  112483761 Jul  7 14:07 reads3.fasta
+-rw-r--r--  1 robertkofler  staff  113818134 Jul  7 14:08 reads4.fasta
+...
+-rw-r--r--  1 robertkofler  staff  113177013 Jul  7 14:46 reads49.fasta
+-rw-r--r--  1 robertkofler  staff  113611996 Jul  7 14:47 reads50.fasta
 ~~~~~

 ## next steps

Walkthrough modified by Robert Kofler

Robert Kofler — Fri, 07 Jul 2017 07:56:50 -0000

--- v21
+++ v22
@@ -120,7 +120,7 @@
 python read_individual_pacbio.py --pg custom-landscape.pg --rld-file rld.txt --error-rate 0.1 --deletion-fraction 0.5 --reads 10000 --fasta-prefix reads
 ~~~~~

-Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (*--deletion-fraction*); 10.000 reads will be simulated for each diploid individual. Since we have 100 haploid genomes, 50 diploid individuals will be simulated (always two consecutive haploid genomes from the pgd-file).
+Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (*--deletion-fraction*); We have 100 haploid genomes thus 50 diploid individuals will be simulated (always two consecutive entries from the pgd-file). We simulate 10.000 reads for each diploid individual. 

 ~~~~~
 files...

Walkthrough modified by Robert Kofler

Robert Kofler — Thu, 06 Jul 2017 14:31:37 -0000

--- v20
+++ v21
@@ -126,4 +126,5 @@
 files...
 ~~~~~

+## next steps

Walkthrough modified by Robert Kofler

Robert Kofler — Thu, 06 Jul 2017 14:31:12 -0000

--- v19
+++ v20
@@ -115,3 +115,15 @@

 The simulated reads will have the following length distribution
 [[img src=rld.png ]]
+
+~~~~~
+python read_individual_pacbio.py --pg custom-landscape.pg --rld-file rld.txt --error-rate 0.1 --deletion-fraction 0.5 --reads 10000 --fasta-prefix reads
+~~~~~
+
+Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (*--deletion-fraction*); 10.000 reads will be simulated for each diploid individual. Since we have 100 haploid genomes, 50 diploid individuals will be simulated (always two consecutive haploid genomes from the pgd-file).
+
+~~~~~
+files...
+~~~~~
+
+