Recent changes to Defining_TE_Landscape

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 15:11:12 -0000

--- v39
+++ v40
@@ -1,368 +1,4 @@
 [TOC]
-# Intro #
-We developed a simple domain specific language that allows to define arbitrary complex **TE landscapes, i.e. all TE insertions in the genomes of all individuals of a population**.
-The following key features of a TE landscapes may be specified by the user:
-
-* the genomic position of TE insertions
-* the strand of TE insertions
-* the sequence of TE insertions
-* the population frequency of TE insertions
-* the target site duplication (TSD)
-* internal truncations (for example the KP element is an internally truncated P-element)
-* the sequence divergence of a insertions from the consensus family (in percent)
-* arbitrary complex nested insertions, i.e TE insertions within TE insertions
-
-
-
-# build-population-genome.py
-
-The script *build-population-genome.py* is the **heart** of SimulaTE.  As an input it requires the definition of a TE landscape and as output it generates a multiple fasta file, where the first entry represents the genome of the first individual in a population, the second entry the genome of the second individual in the population and so on. We call this multiple fasta file also the **population genome** (as it has the sequence of each individual within a population). The population genome serves as the basis for simulating the short reads.
-
-## parameters 
-*build-population-genome.py* takes the following parameters, where parameters within square brakets are optional.
-
-* \[--chasis\] a fasta file containing a single sequence. TEs will be inserted into this sequence; We call this sequence the **chasis**. As example a chromosme arm could be provided. 
-* \ [--te-seqs\] a fasta file that may contain one or more entries; this can be used to define the TE sequences that will be inserted into the chasis; For example the consensus sequences of TE families could be provided 
-*  --pgd a population-genome-definition (pgd) file; the definition of the TE landscape;  see below
-*  --output the output file; will be a multiple fasta file containing the genome of each individual within a population
-
-## the population-genome-definition (pgd) file
-
-The most important parameter of *build-population-genome.py* is the  population-genome-definition (pgd) file
-The pgd-file specifies the TE landscape of a population, ie the positions and sequences of all TE insertions within a population. 
-
-Following an example of a pgd-file:
-
-~~~~~~
-chasis="AAAAAAAAAA"
-hobo="TTT"
-roo="CCCC"
-jockey=$1
-pelement=hobo+{2:roo}
-2 hobo1 hobo roo roo * *
-6 roo roo- * * * *
-9 * * * * * hobo-{1:roo+}
-~~~~~~
-
-The pgd file consists of  **three parts**:  definition of the chasis, TE sequences and insertion sites.
-
-**1.) definition of the chasis (optional)**
-The first part, the definition of the chasis, is optional. The sequence of the chasis may be  provided in the pgd-file, for example:
-
-~~~~~~
-chasis="AAAAAAAAAA"
-~~~~~~
-
-The definition requires the keyword "chasis" followed by an equals sign and the sequence of the chasis within quotation marks.  If the chasis is not provided in the pgd-file it must be provided as separate fasta file (*build-population-genomy.py --chasis*). All TEs will be inserted into the chasis. The chasis therefore acts as the "reference genome".
-
-**Note**:  providing the chasis within the pgd-file mostly makes sense for small toy examples. For simulationing large chromosomes we recommed to provide the chasis as fasta file.
-
-**2.) definition of TE sequences**
-
-Sequences of TEs may be defined as in the following examples:
-
-~~~~~~
-hobo="TTT"
-jockey=$1
-pelement=hobo+{2:roo}
-roo="CCCC"-3bp
-ine=$2+5bp
-~~~~~~
-
-The "=" divides the definition of TE sequences into a left and a right part. The left part is an arbitrary name for a TE sequence (like a variable name) and the right part specifies the sequence of a TE.
-The sequence of a TE (i.e the right part) may be specified
-
-1 directly (when using quotation marks, eg "TTT")
-* by reference to a user provided fasta file  (*build-population-genome.py --te-seqs*) which contains the sequences of all used TEs. For example *$1* refers to the first entry in the fasta file, *$2* to the second and so on
-* by modification of already defined TE sequences using the domain specific language described in the next chapter; this domain specific language for example allows to specif the strand of a TE, the TSD, internal truncations, nested insertions etc. As  quick examples *"CCCC"*
-
-
-**3.) definition of insertion sites**
-
-Following an example of the definition of the insertion sites
-~~~~~~
-2 hobo hobo roo roo * *
-6 roo roo- * * * *
-9 * * * * * hobo-{1:roo+}
-~~~~~~
-
-* col1: the position within the chasis
-* col2: the TE to be inserted at the given position (col1) in the first haploid genome 
-* col3: the TE to be inserted at the given position (col1) in the second haploid genome 
-* col-n: the TE to be inserted at the given position (col1) in the n-1 haploid genome
-
-**Note** the number of haploid genomes is the number of columns minus one 
-
-**Note** all definitions of insertion sites must have the same number of columns
-
-**Note** the star  (\*) is used to indicate absence of a TE insertion in a given individual
-
-**Note diploid genomes**
-For simulating the reads of diploid individuals consecutive pairs of haploid genomes are used. Simulating 50 diploids thus requires specification of 100 haploid genomes, where col2+col3 will form the first diploid genome, col4+col5 the second diploid genome, etc...
-
-
-
-## output: the population genome
-
-The output will be a multiple fasta file, containing the population genome, i.e. the genomes of all individuals of a population. 
-
-As an example, the following output may be generated:
-
-~~~~~
->hg1
-AATTTAAAACCCCAAAA
->hg2
-AATTTAAAACCCCAAAA
->hg3
-AACCCCAAAAAAAA
->hg4
-AACCCCAAAAAAAA
->hg5
-AAAAAAAAAA
->hg6
-AAAAAAAAAA
-~~~~~
-
-Where *hg1* is the haploid genome of the first individual in a populatin, *hg2* the haploid genome of the second individual in a population and so on.
-
-# Walkthrough: specifying TE landscapes with the pgd-file
-The following walkthrough demonstrates the different ways to generate population genomes from pgd-files (specifications of a TE landscape). Additionally different features of the pgd-file are explained.
-
-## required information
-The definition of a TE landscape with a pgd-file requires three pieces of information, 
-
-* a sequence into which TEs should be inserted (we call this the **chasis**);  either  provided in the pgd (see below) or in a fasta file
-* **TE sequence(s)**; either provided in the pgd (see below) or in a fasta file
-*  the **positions of TE insertions** within the chasis; must be provided in the pgd (see below)
-
-##  a simple scenario - one individual
-The simplest possible scenario is to provide the TEs, the chasis and the insertion position in the **population-genome-definition file (pgd)**, as in the following example:
-
-~~~~~~
-# note: for demonstration purposes this population only contains a single individual
-chasis="AAAAAAAAAA"
-te1="TTT"
-5 te1
-~~~~~~
-
-The chasis (the sequence into which to insert a TE) has the sequence *AAAAAAAAAA*.
-We defined a single TE with the name *te1*  and the sequence *TTT*.  We further define that at position 5 of the chasis the sequence of te1 should be inserted. Per default a TSD of zero is used.
-
-Use the following command to generate the popualtion genome for this pgd-file
-
-~~~~~~
-python simulate/build-population-genome.py --pgd simple1.pgd --output simple1.fasta
-~~~~~~
-
-we obtain the following multiple fasta file (population genome file):
-
-~~~~~~
->hg1
-AAAAATTTAAAAA
-~~~~~~
-*hg1* stands for haploid genome 1
-
-**Note**: in this toy example we defined the population-genome for a population consisting of a single individual
-
-
-## a simple scenario - multiple individuals
-In this  example we define the genomes of all individuals in a population of size 6, with the first 3 individuals having a TE insertion and the last 3 not having a TE insertion:
-
-~~~~~~
-chasis="AAAAAAAAAA"
-te1="TTT"
-5 te1 te1 te1 * * *
-~~~~~~
-
-after calling *build-population-genome.py* this will yield the following multiple fasta file
-
-~~~~~~
->hg1
-AAAAATTTAAAAA
->hg2
-AAAAATTTAAAAA
->hg3
-AAAAATTTAAAAA
->hg4
-AAAAAAAAAA
->hg5
-AAAAAAAAAA
->hg6
-AAAAAAAAAA
-~~~~~~
-
-Voila, this file contains the genomes of the six individuals within our population, haploid genome 1 (hg1) to haploid genome 6 (hg6).
-
-## multiple TE families and multiple individuals
-More complex scenarious are possible. In the following example we define the sequences of two TE families and two different insertion positions.
-
-~~~~~~
-chasis="AAAAAAAAAA"
-te1="TTT"
-roo="CCCC"
-2 te1 te1 roo roo * *
-6 roo roo * * * *
-~~~~~~
-
-We specified sequences for the TE insertions *te1* and *roo*. TEs will be inserted at positions 2 and 6 of the chasis.
-**Note**  at position 2,  insertions of *roo* and *te1* occur at the same positions in different individuals, i.e. the TE insertions overlap. This example  will yield the following population genome.
-
-~~~~~~
->hg1
-AATTTAAAACCCCAAAA
->hg2
-AATTTAAAACCCCAAAA
->hg3
-AACCCCAAAAAAAA
->hg4
-AACCCCAAAAAAAA
->hg5
-AAAAAAAAAA
->hg6
-AAAAAAAAAA
-~~~~~~
-
-## provide the chasis in a file
-The chasis may also be provided in a fasta file containing a single entry. Providing the chasis in the pgd-file is only recommended for toy examples as in this walkthrough. For all applications using real data we recommend to provide the chasis in a file.
-
-Given the following fasta file (*chasis-10A.fasta*):
-
-~~~~~~
->some_arbitrary_name
-AAAAAAAAAA
-~~~~~~
-
-and the following pgd-file (*simple.pgd*)
-
-~~~~~~
-hobo="TTT"
-5 hobo1 hobo * hobo * *
-~~~~~~
-
-the following command can be used to generate a population genome file:
-
-~~~~~
-python simulate/build-population-genome.py --chasis chasis-10A.fasta --pgd simple.pgd --output test.fasta
-~~~~~
-
-The output file (*test.fasta*) than contains the follwoing
-
-~~~~~
->hg1
-AAAAATTTAAAAA
->hg2
-AAAAATTTAAAAA
->hg3
-AAAAAAAAAA
->hg4
-AAAAATTTAAAAA
->hg5
-AAAAAAAAAA
->hg6
-AAAAAAAAAA
-~~~~~
-
-## provide TE sequences in a file
-
-Similarly to the chasis, also the TE sequences may be provided in a separate file. This is again recommended for applications with real data.
-
-Given the following fasta file with TE sequences (*teseqs.fasta*)
-
-~~~~~~
->hobo
-TTT
->roo
-CCCC
-~~~~~~
-
-and the following pgd-file (*simple.pgd*)
-
-~~~~~
-chasis="AAAAAAAAAA"
-2 $1 $1 * * * *
-7 * * * * $2 $2
-~~~~~
-
-Remember that $1 is the sequence of the entry in the fasta file and $2 the sequence of the second entry.
-than the following command can be used to generate the population genome file
-
-~~~~~~
-python simulate/build-population-genome.py --te-seqs teseqs.fasta  --pgd simple.pgd --output test.fasta 
-~~~~~~
-
-this will yield the output
-
-~~~~~
->hg1
-AATTTAAAAAAAA
->hg2
-AATTTAAAAAAAA
->hg3
-AAAAAAAAAA
->hg4
-AAAAAAAAAA
->hg5
-AAAAAAACCCCAAA
->hg6
-AAAAAAACCCCAAA
-~~~~~
-
-## redefine TE-sequences that were provided in a file
-
-The sequences from a fasta file may be used directly (using *$1* like in the previous example) or alternatively they could  be redefined, either  in i) the header of the pgd-file or  ii)  in the definition of the insertion sites. This redefinition of TE sequences allows to specifiy a i) TSD,  ii) the strand, ii) truncations (like the KP-element), iii) sequence divergence and iv) nested TE insertions. Details of the domain specific language for  specifying (redefining) complex TE sequences are provided in the next section.
-
-For example given the two TE sequences in a fasta file (teseqs.fasta):
-
-~~~~~
->arbitrary_name_1
-TTT
->arbitrary_name_2
-CCCC
-~~~~~
-
-and the following pgd-file (simple.pgd):
-
-~~~~~~
-chasis="AAAAAAAAAA"
-hobo=$1+2bp
-roo=$2-0bp
-2 hobo hobo * * * *
-7 * * * * roo $2-2bp
-~~~~~~
- 
- This pgd-file contains three redefinitions of TEs:
- 
-1 *hobo* is specified as the first sequence in the fasta file, the sequence is in the forward direction and a TSD of 2bp is generated upon insertion of the TE
-* *roo* is the reverse complement of the second sequence in the fasta file; no TSD is created
-* *$2-2bp* shows that it is also feasible to redefine TEs directly at the insertion site; the reverse complement of the second sequence in the fasta file will be used, and a TSD of 2bp will be generated.
-
-when running the following command:
-
-~~~~~~
-python ~/dev/simulate/build-population-genome.py --te-seqs teseqs.fasta --pgd simple.pgd --output test.fasta
-~~~~~~
-
-we will obtain the following output file 
-~~~~~
->hg1
-AATTTAAAAAAAAAA
->hg2
-AATTTAAAAAAAAAA
->hg3
-AAAAAAAAAA
->hg4
-AAAAAAAAAA
->hg5
-AAAAAAAGGGGAAA
->hg6
-AAAAAAAGGGGAAAAA
-~~~~~
-
-## summary of the population-genome-definition (pgd) file
-A pgd file contains of two parts.
-The first part provided the definitions of TE sequences 
-
-may contain a chasis or not contain it if provided as file

 # Basic options #

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 15:02:47 -0000

--- v38
+++ v39
@@ -344,6 +344,18 @@

 we will obtain the following output file 
 ~~~~~
+>hg1
+AATTTAAAAAAAAAA
+>hg2
+AATTTAAAAAAAAAA
+>hg3
+AAAAAAAAAA
+>hg4
+AAAAAAAAAA
+>hg5
+AAAAAAAGGGGAAA
+>hg6
+AAAAAAAGGGGAAAAA
 ~~~~~

 ## summary of the population-genome-definition (pgd) file

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 15:01:48 -0000

--- v37
+++ v38
@@ -330,11 +330,21 @@
 7 * * * * roo $2-2bp
 ~~~~~~

- This pgd-file has three redefinitions of TEs provided in a separate file
+ This pgd-file contains three redefinitions of TEs:

- 1 *hobo* is specified as the first sequence in the fasta file, the sequence is forward direction and a TSD has 2bp  strand and with a TSD of 2bp.
-The TE roo is defined as the second sequenc
-
+1 *hobo* is specified as the first sequence in the fasta file, the sequence is in the forward direction and a TSD of 2bp is generated upon insertion of the TE
+* *roo* is the reverse complement of the second sequence in the fasta file; no TSD is created
+* *$2-2bp* shows that it is also feasible to redefine TEs directly at the insertion site; the reverse complement of the second sequence in the fasta file will be used, and a TSD of 2bp will be generated.
+
+when running the following command:
+
+~~~~~~
+python ~/dev/simulate/build-population-genome.py --te-seqs teseqs.fasta --pgd simple.pgd --output test.fasta
+~~~~~~
+
+we will obtain the following output file 
+~~~~~
+~~~~~

 ## summary of the population-genome-definition (pgd) file
 A pgd file contains of two parts.

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 14:44:46 -0000

--- v36
+++ v37
@@ -307,9 +307,11 @@
 AAAAAAACCCCAAA
 ~~~~~

-## redefining TE-sequences provided in a separate fasta file
-
-The sequences from a fasta file can be used directly (using *$1* like in the previous example) but they could also be redefined in i) the header of the pgd-file or  ii) in the definition of the insertion sites. This redefinition of TE sequences allows to specifiy a i) TSD,  ii) the strand, ii) truncations (like the KP-element), iii) sequence divergence and iv) nested TE insertions. Details of the domain specific language for  specifying complex TE sequences are provided in the next section.
+## redefine TE-sequences that were provided in a file
+
+The sequences from a fasta file may be used directly (using *$1* like in the previous example) or alternatively they could  be redefined, either  in i) the header of the pgd-file or  ii)  in the definition of the insertion sites. This redefinition of TE sequences allows to specifiy a i) TSD,  ii) the strand, ii) truncations (like the KP-element), iii) sequence divergence and iv) nested TE insertions. Details of the domain specific language for  specifying (redefining) complex TE sequences are provided in the next section.
+
+For example given the two TE sequences in a fasta file (teseqs.fasta):

 ~~~~~
 >arbitrary_name_1
@@ -318,6 +320,7 @@
 CCCC
 ~~~~~

+and the following pgd-file (simple.pgd):

 ~~~~~~
 chasis="AAAAAAAAAA"
@@ -326,6 +329,12 @@
 2 hobo hobo * * * *
 7 * * * * roo $2-2bp
 ~~~~~~
+ 
+ This pgd-file has three redefinitions of TEs provided in a separate file
+ 
+ 1 *hobo* is specified as the first sequence in the fasta file, the sequence is forward direction and a TSD has 2bp  strand and with a TSD of 2bp.
+The TE roo is defined as the second sequenc
+

 ## summary of the population-genome-definition (pgd) file
 A pgd file contains of two parts.
@@ -341,3 +350,8 @@

 # Avanced options #
+
+## specifying the TSD
+two options
+either before or after
+we arbitrary picked the first

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:44:18 -0000

--- v35
+++ v36
@@ -26,7 +26,7 @@
 *  --pgd a population-genome-definition (pgd) file; the definition of the TE landscape;  see below
 *  --output the output file; will be a multiple fasta file containing the genome of each individual within a population

-## quick intro: population-genome-definition (pgd) file
+## the population-genome-definition (pgd) file

 The most important parameter of *build-population-genome.py* is the  population-genome-definition (pgd) file
 The pgd-file specifies the TE landscape of a population, ie the positions and sequences of all TE insertions within a population. 
@@ -102,7 +102,7 @@

-## output 
+## output: the population genome

 The output will be a multiple fasta file, containing the population genome, i.e. the genomes of all individuals of a population. 

@@ -283,6 +283,7 @@
 7 * * * * $2 $2
 ~~~~~

+Remember that $1 is the sequence of the entry in the fasta file and $2 the sequence of the second entry.
 than the following command can be used to generate the population genome file

 ~~~~~~
@@ -305,6 +306,26 @@
 >hg6
 AAAAAAACCCCAAA
 ~~~~~
+
+## redefining TE-sequences provided in a separate fasta file
+
+The sequences from a fasta file can be used directly (using *$1* like in the previous example) but they could also be redefined in i) the header of the pgd-file or  ii) in the definition of the insertion sites. This redefinition of TE sequences allows to specifiy a i) TSD,  ii) the strand, ii) truncations (like the KP-element), iii) sequence divergence and iv) nested TE insertions. Details of the domain specific language for  specifying complex TE sequences are provided in the next section.
+
+~~~~~
+>arbitrary_name_1
+TTT
+>arbitrary_name_2
+CCCC
+~~~~~
+
+
+~~~~~~
+chasis="AAAAAAAAAA"
+hobo=$1+2bp
+roo=$2-0bp
+2 hobo hobo * * * *
+7 * * * * roo $2-2bp
+~~~~~~

 ## summary of the population-genome-definition (pgd) file
 A pgd file contains of two parts.

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:30:13 -0000

--- v34
+++ v35
@@ -283,6 +283,29 @@
 7 * * * * $2 $2
 ~~~~~

+than the following command can be used to generate the population genome file
+
+~~~~~~
+python simulate/build-population-genome.py --te-seqs teseqs.fasta  --pgd simple.pgd --output test.fasta 
+~~~~~~
+
+this will yield the output
+
+~~~~~
+>hg1
+AATTTAAAAAAAA
+>hg2
+AATTTAAAAAAAA
+>hg3
+AAAAAAAAAA
+>hg4
+AAAAAAAAAA
+>hg5
+AAAAAAACCCCAAA
+>hg6
+AAAAAAACCCCAAA
+~~~~~
+
 ## summary of the population-genome-definition (pgd) file
 A pgd file contains of two parts.
 The first part provided the definitions of TE sequences

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:27:26 -0000

--- v33
+++ v34
@@ -262,7 +262,26 @@
 AAAAAAAAAA
 ~~~~~

-## three ways to define a TE sequence
+## provide TE sequences in a file
+
+Similarly to the chasis, also the TE sequences may be provided in a separate file. This is again recommended for applications with real data.
+
+Given the following fasta file with TE sequences (*teseqs.fasta*)
+
+~~~~~~
+>hobo
+TTT
+>roo
+CCCC
+~~~~~~
+
+and the following pgd-file (*simple.pgd*)
+
+~~~~~
+chasis="AAAAAAAAAA"
+2 $1 $1 * * * *
+7 * * * * $2 $2
+~~~~~

 ## summary of the population-genome-definition (pgd) file
 A pgd file contains of two parts.

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:08:44 -0000

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:08:04 -0000

--- v31
+++ v32
@@ -245,6 +245,23 @@
 python simulate/build-population-genome.py --chasis chasis-10A.fasta --pgd simple.pgd --output test.fasta
 ~~~~~

+The output file (*test.fasta*) than contains the follwoing
+
+~~~~~
+>hg1
+AAAAATTTAAAAA
+>hg2
+AAAAATTTAAAAA
+>hg3
+AAAAAAAAAA
+>hg4
+AAAAATTTAAAAA
+>hg5
+AAAAAAAAAA
+>hg6
+AAAAAAAAAA
+~~~~~
+
 ## three ways to define a TE sequence

 ## summary of the population-genome-definition (pgd) file

Defining_TE_Landscape modified by Robert Kofler

Robert Kofler — Thu, 18 May 2017 13:06:24 -0000

--- v30
+++ v31
@@ -223,7 +223,7 @@
 ~~~~~~

 ## provide the chasis in a file
-The chasis may also be provided in a fasta file containing a single entry.
+The chasis may also be provided in a fasta file containing a single entry. Providing the chasis in the pgd-file is only recommended for toy examples as in this walkthrough. For all applications using real data we recommend to provide the chasis in a file.

 Given the following fasta file (*chasis-10A.fasta*):