Recent changes to WalkthroughPreparatoryWork

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 10:39:04 -0000

--- v21
+++ v22
@@ -127,6 +127,8 @@
 Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or  some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).

 It's propably best to first get a list of all the entries in the fasta file
+
+
 ~~~~
 cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
 ~~~~
@@ -142,26 +144,25 @@
 ~~~~

 ##Starting with a set of consensus TE sequences using iterative mapping ##
-It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655). Note that for proper PoPoolationTE2 performance it is **not** necessary that RepeatMasker identifies all TE sequences, but rather that any  ambiguous mapping between TE and reference are avoided.
-The following approach ensures that all TE-derived reads will align to a TE sequence (and not to a reference chromosome). This approach is based on artificial reads for the TE sequences and relies directly on the mapper that will be used for mapping reads to the TE-merged-reference. In short artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper, and the mapping positions of the reads will be masked. This procedure is repeated until no further unmasked regions are found.
-
-We start with the reference sequence and the set of TE consensus sequences. 
-This walkthrough additionally requires
+It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655). Note that for proper performance of PoPoolationTE2, it is **not** necessary that all TE sequences are identified with RepeatMasker, but rather that any ambiguous mapping between TE-sequence and masked-reference-genome are avoided.
+The following approach ensures (!) that all TE-derived reads will align to a TE sequence (and not to a reference chromosome). This approach is based on artificial reads for the TE sequences and relies directly on the mapper that will be used for mapping reads to the TE-merged-reference further downstream in the pipelin. In short, artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper (e.g. bwa bwasw), and the mapping positions of the reads will be masked ('N'). This procedure is repeated until no further unmasked regions are found.
+
+We start with the reference sequence and the set of TE consensus sequences. This walkthrough additionally requires

 + samtools
-+ a mapper (e.g. bwa)
++ a mapper (here bwa bwasw)
 + The script: http://sourceforge.net/projects/popoolation-te2/files/create-reads-for-te-sequences.py

 ** Create reads for TE sequences**
-The following scripts creates artificial single-end reads for the TE sequences. The script will generate *--boost* reads for every position in every TE seqeunce. Moreover reads will have a random 'sequencing error rate' ranging from *0.0* to *--max-error-rate*.  Note that a high boost factor will serve to reduce the number of necessary iterations.
+The following scripts creates artificial single-end reads for TE sequences. The script will generate *--boost* reads for every site in every TE sequence. Moreover, reads will have a random 'sequencing error rate' ranging from *0.0* to *--max-error-rate*.  Note that a high boost factor will serve to reduce the number of necessary iterations.

 ~~~~~
 python create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
 ~~~~~

 ** Map reads to the reference genome ** 
-Use the same mapper here that you will also use for mapping your real data to the TE-merged-reference. We recommend bwa bwasw.
+Use the same mapper  that  will latter on be used to map your paired end reads to the TE-merged-reference. We recommend bwa bwasw.

 ~~~~~
 bwa index 2R-cyp6g1.fasta
@@ -182,8 +183,8 @@
 cat 2R-cyp6g1.masked.fasta|grep -v '^>'|grep 'N'|perl -pe 's/[^N]//g'|wc -c
 # output is 6773
 ~~~~
-The output of 6773 means that 6773 bases have been masked with this approach.
-You can now repeat the procedure starting from
+The output states that 6773 bases have been masked with this approach.
+You may  repeat the procedure starting from

 ~~~~~
 python create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
@@ -203,7 +204,8 @@

 Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or  some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).

-It's propably best to first get a list of all the entries in the fasta file
+It's propably best to first get a list of all the entries in the fasta file:
+
 ~~~~
 cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
 ~~~~

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 10:31:26 -0000

--- v20
+++ v21
@@ -13,7 +13,7 @@
 **1. download the data **
 http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download

-We need the fasta sequence of the chromosome and the TE annotation which is in the bed format. Following a small example of the TE annotation.
+We need the fasta sequence of the chromosome (fasta format) and the TE annotation ( bed format). Following a small example of the TE annotation.

 ~~~~~
 2R 13240   13529   1360_1
@@ -49,6 +49,7 @@
 Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. 2R-cyp6g1.teseqs.fasta).

 It's propably best to first get a list of all the entries in the fasta file
+
 ~~~~
 cat 2R-cyp6g1.teseqs.fasta|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
 ~~~~

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Fri, 05 Feb 2016 10:29:23 -0000

--- v19
+++ v20
@@ -1,10 +1,10 @@
 [TOC]
 #Walkthrough preparatory work#
-PopoolationTE2  requires as prerequisites a TE-merged-reference and a TE hierarchy (for details see [Manual]). In this walkthrough we will create these two files using a small sample data set. The following walkthrough is based on a small  region of  *D. melanogaster* chromosome 2R, ranging from 7,500,000 to 8,500,000. 
-
-However if you want to generate this prerequisites for your data you need to have  a.) reference genome and either b1) a TE annotation for the reference genome or b2) a set of consensus sequences of TEs.
-
-## Prerequisites##
+As prerequisites  PopoolationTE2  requires a TE-merged-reference and a TE hierarchy (for details see [Manual]). In this walkthrough we will create these two files using a small sample data set. The following walkthrough is based on a small  region of  *D. melanogaster* chromosome 2R, ranging from 7,500,000 to 8,500,000. 
+
+If you want to generate these prerequisites for your data you need  a.) a reference genome and either b1) a TE annotation for the reference genome or b2) a set of consensus sequences of TEs.
+
+## Requirements ##
 * bedtools
 * RepeatMasker

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Thu, 04 Feb 2016 09:51:23 -0000

--- v18
+++ v19
@@ -141,8 +141,8 @@
 ~~~~

 ##Starting with a set of consensus TE sequences using iterative mapping ##
-It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655).
-To avoid this problem, we introduce an additional approach for masking the genome which is based on artificial reads for the TE sequences and on the mapper that will be used for mapping reads to the TE-merged-reference. In short artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper, and the mapping positions of the reads will be masked. This procedure is repeated until no further unmasked regions are found.
+It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655). Note that for proper PoPoolationTE2 performance it is **not** necessary that RepeatMasker identifies all TE sequences, but rather that any  ambiguous mapping between TE and reference are avoided.
+The following approach ensures that all TE-derived reads will align to a TE sequence (and not to a reference chromosome). This approach is based on artificial reads for the TE sequences and relies directly on the mapper that will be used for mapping reads to the TE-merged-reference. In short artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper, and the mapping positions of the reads will be masked. This procedure is repeated until no further unmasked regions are found.

 We start with the reference sequence and the set of TE consensus sequences. 
 This walkthrough additionally requires

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:51:46 -0000

--- v17
+++ v18
@@ -188,7 +188,7 @@
 python create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
 ~~~~~

-until the count of masked N's does not increase anymore
+until the count of masked N's does not increase anymore. Note that the number of necessary iterations may be reduced by setting a high *--boost* value.

 **create the TE merged reference**

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:50:13 -0000

--- v16
+++ v17
@@ -169,3 +169,44 @@

 ** Mask the mapping positions **

+~~~~
+bedtools bamtobed -i tereads.sort.bam > tereads.bed
+bedtools maskfasta -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.masked.fasta -bed tereads.bed
+~~~~
+
+** Count the masked sequence and repeat if necessary**
+
+~~~~
+:::bash
+cat 2R-cyp6g1.masked.fasta|grep -v '^>'|grep 'N'|perl -pe 's/[^N]//g'|wc -c
+# output is 6773
+~~~~
+The output of 6773 means that 6773 bases have been masked with this approach.
+You can now repeat the procedure starting from
+
+~~~~~
+python create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
+~~~~~
+
+until the count of masked N's does not increase anymore
+
+**create the TE merged reference**
+
+Simply concatenate the RepeatMasker output (*.fasta.masked) with the consensus sequences of TEs
+
+~~~~
+cat 2R-cyp6g1.fasta.masked teseqs.txt > 2R-cyp6g1.temergedref.fasta
+~~~~
+
+** create the TE hierarchy**
+
+Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or  some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).
+
+It's propably best to first get a list of all the entries in the fasta file
+~~~~
+cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
+~~~~
+
+ Than you may manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual]
+
+

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:41:34 -0000

--- v15
+++ v16
@@ -149,7 +149,7 @@

 + samtools
 + a mapper (e.g. bwa)
-+ The script: 
++ The script: http://sourceforge.net/projects/popoolation-te2/files/create-reads-for-te-sequences.py

 ** Create reads for TE sequences**

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:38:17 -0000

--- v14
+++ v15
@@ -144,7 +144,12 @@
 It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655).
 To avoid this problem, we introduce an additional approach for masking the genome which is based on artificial reads for the TE sequences and on the mapper that will be used for mapping reads to the TE-merged-reference. In short artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper, and the mapping positions of the reads will be masked. This procedure is repeated until no further unmasked regions are found.

-We start with the reference sequence and the set of TE consensus sequences. Also download the script: 
+We start with the reference sequence and the set of TE consensus sequences. 
+This walkthrough additionally requires
+
++ samtools
++ a mapper (e.g. bwa)
++ The script: 

 ** Create reads for TE sequences**

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:36:14 -0000

--- v13
+++ v14
@@ -159,7 +159,7 @@

 ~~~~~
 bwa index 2R-cyp6g1.fasta
-bwa bwasw 2R-cyp6g1.fasta teread.txt |samtools view -Sb - |samtools sort tereads.sort
+bwa bwasw 2R-cyp6g1.fasta teread.txt |samtools view -Sb - |samtools sort - tereads.sort
 ~~~~~

 ** Mask the mapping positions **

WalkthroughPreparatoryWork modified by Robert Kofler

Robert Kofler — Tue, 26 Jan 2016 15:35:24 -0000

--- v12
+++ v13
@@ -13,7 +13,7 @@
 **1. download the data **
 http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download

-We need the fasta sequence of the chromosome and the annotation which is in bed. Following a small sample
+We need the fasta sequence of the chromosome and the TE annotation which is in the bed format. Following a small example of the TE annotation.

 ~~~~~
 2R 13240   13529   1360_1
@@ -46,7 +46,7 @@
 ~~~~

 **5. create the TE hierachy**
-Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or with some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. 2R-cyp6g1.teseqs.fasta).
+Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. 2R-cyp6g1.teseqs.fasta).

 It's propably best to first get a list of all the entries in the fasta file
 ~~~~
@@ -81,15 +81,16 @@
 ##Starting with a set of consensus TE sequences using RepeatMasker##
 **1. download the data **
 http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download
-In addition to the reference sequence we now need a set of consensus sequences of TEs (eg one sequence for every relevant family). 
-In this walkthrough we use the file teseqs.txt.  
-This file contains consensus sequences for four TE families in Drosophila
+We need the reference sequence and the set of TE consensus sequences. The file *teseqs.txt* contains  consensus sequences for four TE families in Drosophila.

+
+Let's investigate the content of the file with

 ~~~~
 cat teseqs.txt|grep -A 1 '^>'
 ~~~~

+which yields:

 ~~~~
 >INE1
@@ -122,7 +123,7 @@

 **4. create the TE hierarchy**

-Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or with some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).
+Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or  some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).

 It's propably best to first get a list of all the entries in the fasta file
 ~~~~
@@ -139,17 +140,27 @@
 DM_ROO  roo     LTR
 ~~~~

-##Starting with a set of consensus TE sequences using iterative mapping (most reliable)##
+##Starting with a set of consensus TE sequences using iterative mapping ##
+It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655).
+To avoid this problem, we introduce an additional approach for masking the genome which is based on artificial reads for the TE sequences and on the mapper that will be used for mapping reads to the TE-merged-reference. In short artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper, and the mapping positions of the reads will be masked. This procedure is repeated until no further unmasked regions are found.
+
+We start with the reference sequence and the set of TE consensus sequences. Also download the script: 

-#FAQ#
-**What's the difference between these two approaches (annotation vs. consensus sequences)**
+** Create reads for TE sequences**
+The following scripts creates artificial single-end reads for the TE sequences. The script will generate *--boost* reads for every position in every TE seqeunce. Moreover reads will have a random 'sequencing error rate' ranging from *0.0* to *--max-error-rate*.  Note that a high boost factor will serve to reduce the number of necessary iterations.

-The approach based on the annotation allows to have for every family multiple sequences, that may be slighly divereged.  
-By contrast the consensus sequence based approach provides for every family a single sequence, i.e. the consensus sequence. The annotation-based approach may thus be more sensitive and allow the identification of TE seqeunces that diverged from the consensus sequence
+~~~~~
+python create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
+~~~~~

-**How do I create a TE annotation which allows to proceed with the more sensitive annotation-based approach**
-Several software tools exists for annotating TE sequences. Especially Hadi Quesneville and Casey Bergman did great work about TE annotation. Here is a link to our work where we annotated the Drosophila simulans genome using a custom pipeline http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005406
+** Map reads to the reference genome ** 
+Use the same mapper here that you will also use for mapping your real data to the TE-merged-reference. We recommend bwa bwasw.

+~~~~~
+bwa index 2R-cyp6g1.fasta
+bwa bwasw 2R-cyp6g1.fasta teread.txt |samtools view -Sb - |samtools sort tereads.sort
+~~~~~

+** Mask the mapping positions **