PoPoolation TE2 Wiki

Brought to you by: rokofler

WalkthroughPreparatoryWork

Walkthrough preparatory work

Walkthrough preparatory work

As prerequisites PopoolationTE2 requires a TE-merged-reference and a TE hierarchy (for details see [Manual]). In this walkthrough we will create these two files using a small sample data set. The following walkthrough is based on a small region of D. melanogaster chromosome 2R, ranging from 7,500,000 to 8,500,000.

If you want to generate these prerequisites for your data you need a.) a reference genome and either b1) a TE annotation for the reference genome or b2) a set of consensus sequences of TEs.

Requirements

bedtools
RepeatMasker

Starting with a TE annotation

**1. download the data **
http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download

We need the fasta sequence of the chromosome (fasta format) and the TE annotation ( bed format). Following a small example of the TE annotation.

2R  13240   13529   1360_1
2R  13098   13219   INE1_1
2R  16644   16750   INE1_2
2R  13895   14001   1360_2
2R  108473  113605  DMIFACA_1

The first column is the reference chromosome, the second the start of the TE, the third the end of the TE and the fourth is the ID of the TE. The fourth column must have unique identifiers

2. extract the TE sequences into a separate file

bedtools getfasta -s -name -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.teseqs.fasta -bed 2R-cyp6g1.teannotation.bed

3. mask the TE sequences in the reference genome

bedtools maskfasta -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.masked.fasta -bed 2R-cyp6g1.teannotation.bed

**4. create the TE-merged-reference **

The TE-merged-reference can simply be created by merging the masked reference genome and the TE sequences

cat 2R-cyp6g1.masked.fasta 2R-cyp6g1.teseqs.fasta > 2R-cyp6g1.temergedref.fasta

5. create the TE hierachy
Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. 2R-cyp6g1.teseqs.fasta).

It's propably best to first get a list of all the entries in the fasta file

cat 2R-cyp6g1.teseqs.fasta|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt

and than to manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual].

id      family  order
1360_1  1360    TIR
INE1_1  Ine1    TIR
INE1_2  Ine1    TIR
1360_2  1360    TIR
DMIFACA_1       I-element       nonLTR
INE1_3  Ine1    TIR
INE1_4  Ine1    TIR
INE1_5  Ine1    TIR
DM_ROO_1        roo     LTR
INE1_6  Ine1    TIR
INE1_7  Ine1    TIR
INE1_8  Ine1    TIR
DM_ROO_2        roo     LTR
INE1_9  Ine1    TIR
INE1_10 Ine1    TIR
DMIFACA_2       I-element       nonLTR
INE1_11 Ine1    TIR
INE1_12 Ine1    TIR

Starting with a set of consensus TE sequences using RepeatMasker

1. download the data **
http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download
We need the reference sequence and the set of TE consensus sequences. The file teseqs.txt contains consensus sequences for four TE families in Drosophila.

Let's investigate the content of the file with

cat teseqs.txt|grep -A 1 '^>'

which yields:

>INE1
TATACCCGTTACTAGATTCGTTGAAATGAATGTAACAGGCAGAAGGAAGCGTCTTAGACC
--
>1360
CAAAGACACTAGAATAACAAGATGCGTAACGCCATACGATTTTTTGGCACACGATTTTTT
--
>DMIFACA
CATTACCACTTCAACCTCCGAAGAGATAAGTCGTGCCTCTCAGTCTAAAGCCTCGCTTCG
--
>DM_ROO
TGTTCACACATGAACACGAATATATTTAAAGACTTACAATTTTGGGCTCCGTTCATATCT

2. repeat mask the reference genome

perl RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseqs.txt 2R-cyp6g1.fasta

3. create the TE merged reference

Simply concatenate the RepeatMasker output (*.fasta.masked) with the consensus sequences of TEs

cat 2R-cyp6g1.fasta.masked teseqs.txt > 2R-cyp6g1.temergedref.fasta

4. create the TE hierarchy

Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).

It's propably best to first get a list of all the entries in the fasta file

cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt

Than you may manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual]

id      family  order
INE1    Ine1    TIR
1360    1360    TIR
DMIFACA I-element       nonLTR
DM_ROO  roo     LTR

Starting with a set of consensus TE sequences using iterative mapping

It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655). Note that for proper performance of PoPoolationTE2, it is not necessary that all TE sequences are identified with RepeatMasker, but rather that any ambiguous mapping between TE-sequence and masked-reference-genome are avoided.
The following approach ensures (!) that all TE-derived reads will align to a TE sequence (and not to a reference chromosome). This approach is based on artificial reads for the TE sequences and relies directly on the mapper that will be used for mapping reads to the TE-merged-reference further downstream in the pipelin. In short, artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper (e.g. bwa bwasw), and the mapping positions of the reads will be masked ('N'). This procedure is repeated until no further unmasked regions are found.

We start with the reference sequence and the set of TE consensus sequences. This walkthrough additionally requires

samtools
a mapper (here bwa bwasw)
The script: http://sourceforge.net/projects/popoolation-te2/files/create-reads-for-te-sequences.py

** Create reads for TE sequences
The following scripts creates artificial single-end reads for TE sequences. The script will generate --boost reads for every site in every TE sequence. Moreover, reads will have a random 'sequencing error rate' ranging from 0.0 to --max-error-rate. Note that a high boost factor will serve to reduce the number of necessary iterations.

python <path>create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt

** Map reads to the reference genome **
Use the same mapper that will latter on be used to map your paired end reads to the TE-merged-reference. We recommend bwa bwasw.

bwa index 2R-cyp6g1.fasta
bwa bwasw 2R-cyp6g1.fasta teread.txt |samtools view -Sb - |samtools sort - tereads.sort

** Mask the mapping positions **

bedtools bamtobed -i tereads.sort.bam > tereads.bed
bedtools maskfasta -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.masked.fasta -bed tereads.bed

** Count the masked sequence and repeat if necessary**

:::bash
cat 2R-cyp6g1.masked.fasta|grep -v '^>'|grep 'N'|perl -pe 's/[^N]//g'|wc -c
# output is 6773

The output states that 6773 bases have been masked with this approach.
You may repeat the procedure starting from

python <path>create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt

until the count of masked N's does not increase anymore. Note that the number of necessary iterations may be reduced by setting a high --boost value.

create the TE merged reference

Simply concatenate the RepeatMasker output (*.fasta.masked) with the consensus sequences of TEs

cat 2R-cyp6g1.fasta.masked teseqs.txt > 2R-cyp6g1.temergedref.fasta

** create the TE hierarchy**

It's propably best to first get a list of all the entries in the fasta file:

cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt