As prerequisites PopoolationTE2 requires a TE-merged-reference and a TE hierarchy (for details see [Manual]). In this walkthrough we will create these two files using a small sample data set. The following walkthrough is based on a small region of D. melanogaster chromosome 2R, ranging from 7,500,000 to 8,500,000.
If you want to generate these prerequisites for your data you need a.) a reference genome and either b1) a TE annotation for the reference genome or b2) a set of consensus sequences of TEs.
**1. download the data **
http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download
We need the fasta sequence of the chromosome (fasta format) and the TE annotation ( bed format). Following a small example of the TE annotation.
2R 13240 13529 1360_1
2R 13098 13219 INE1_1
2R 16644 16750 INE1_2
2R 13895 14001 1360_2
2R 108473 113605 DMIFACA_1
The first column is the reference chromosome, the second the start of the TE, the third the end of the TE and the fourth is the ID of the TE. The fourth column must have unique identifiers
2. extract the TE sequences into a separate file
bedtools getfasta -s -name -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.teseqs.fasta -bed 2R-cyp6g1.teannotation.bed
3. mask the TE sequences in the reference genome
bedtools maskfasta -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.masked.fasta -bed 2R-cyp6g1.teannotation.bed
**4. create the TE-merged-reference **
The TE-merged-reference can simply be created by merging the masked reference genome and the TE sequences
cat 2R-cyp6g1.masked.fasta 2R-cyp6g1.teseqs.fasta > 2R-cyp6g1.temergedref.fasta
5. create the TE hierachy
Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. 2R-cyp6g1.teseqs.fasta).
It's propably best to first get a list of all the entries in the fasta file
cat 2R-cyp6g1.teseqs.fasta|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
and than to manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual].
id family order
1360_1 1360 TIR
INE1_1 Ine1 TIR
INE1_2 Ine1 TIR
1360_2 1360 TIR
DMIFACA_1 I-element nonLTR
INE1_3 Ine1 TIR
INE1_4 Ine1 TIR
INE1_5 Ine1 TIR
DM_ROO_1 roo LTR
INE1_6 Ine1 TIR
INE1_7 Ine1 TIR
INE1_8 Ine1 TIR
DM_ROO_2 roo LTR
INE1_9 Ine1 TIR
INE1_10 Ine1 TIR
DMIFACA_2 I-element nonLTR
INE1_11 Ine1 TIR
INE1_12 Ine1 TIR
1. download the data **
http://sourceforge.net/projects/popoolation-te2/files/rawdata_preparatory.zip/download
We need the reference sequence and the set of TE consensus sequences. The file teseqs.txt contains consensus sequences for four TE families in Drosophila.
Let's investigate the content of the file with
cat teseqs.txt|grep -A 1 '^>'
which yields:
>INE1
TATACCCGTTACTAGATTCGTTGAAATGAATGTAACAGGCAGAAGGAAGCGTCTTAGACC
--
>1360
CAAAGACACTAGAATAACAAGATGCGTAACGCCATACGATTTTTTGGCACACGATTTTTT
--
>DMIFACA
CATTACCACTTCAACCTCCGAAGAGATAAGTCGTGCCTCTCAGTCTAAAGCCTCGCTTCG
--
>DM_ROO
TGTTCACACATGAACACGAATATATTTAAAGACTTACAATTTTGGGCTCCGTTCATATCT
2. repeat mask the reference genome
perl RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseqs.txt 2R-cyp6g1.fasta
3. create the TE merged reference
Simply concatenate the RepeatMasker output (*.fasta.masked) with the consensus sequences of TEs
cat 2R-cyp6g1.fasta.masked teseqs.txt > 2R-cyp6g1.temergedref.fasta
4. create the TE hierarchy
Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).
It's propably best to first get a list of all the entries in the fasta file
cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
Than you may manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual]
id family order
INE1 Ine1 TIR
1360 1360 TIR
DMIFACA I-element nonLTR
DM_ROO roo LTR
It has been pointed out that RepeatMasker sometimes misses TE sequences and that this could lead to problems for tools that rely on RepeatMasking for TE identification (http://nar.oxfordjournals.org/content/43/22/10655). Note that for proper performance of PoPoolationTE2, it is not necessary that all TE sequences are identified with RepeatMasker, but rather that any ambiguous mapping between TE-sequence and masked-reference-genome are avoided.
The following approach ensures (!) that all TE-derived reads will align to a TE sequence (and not to a reference chromosome). This approach is based on artificial reads for the TE sequences and relies directly on the mapper that will be used for mapping reads to the TE-merged-reference further downstream in the pipelin. In short, artificial TE-derived reads will be mapped to the reference genome using the appropriate mapper (e.g. bwa bwasw), and the mapping positions of the reads will be masked ('N'). This procedure is repeated until no further unmasked regions are found.
We start with the reference sequence and the set of TE consensus sequences. This walkthrough additionally requires
** Create reads for TE sequences
The following scripts creates artificial single-end reads for TE sequences. The script will generate --boost reads for every site in every TE sequence. Moreover, reads will have a random 'sequencing error rate' ranging from 0.0 to --max-error-rate. Note that a high boost factor will serve to reduce the number of necessary iterations.
python <path>create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
** Map reads to the reference genome **
Use the same mapper that will latter on be used to map your paired end reads to the TE-merged-reference. We recommend bwa bwasw.
bwa index 2R-cyp6g1.fasta
bwa bwasw 2R-cyp6g1.fasta teread.txt |samtools view -Sb - |samtools sort - tereads.sort
** Mask the mapping positions **
bedtools bamtobed -i tereads.sort.bam > tereads.bed
bedtools maskfasta -fi 2R-cyp6g1.fasta -fo 2R-cyp6g1.masked.fasta -bed tereads.bed
** Count the masked sequence and repeat if necessary**
:::bash
cat 2R-cyp6g1.masked.fasta|grep -v '^>'|grep 'N'|perl -pe 's/[^N]//g'|wc -c
# output is 6773
The output states that 6773 bases have been masked with this approach.
You may repeat the procedure starting from
python <path>create-reads-for-te-sequences.py --read-length 100 --te-sequences teseqs.txt --max-error-rate 0.05 --boost 100 --output tereads.txt
until the count of masked N's does not increase anymore. Note that the number of necessary iterations may be reduced by setting a high --boost value.
create the TE merged reference
Simply concatenate the RepeatMasker output (*.fasta.masked) with the consensus sequences of TEs
cat 2R-cyp6g1.fasta.masked teseqs.txt > 2R-cyp6g1.temergedref.fasta
** create the TE hierarchy**
Unfortunatelly, this step can not be automated. We need to manually (or with the help of excel, or some custom script) create a hiearchy entry for every fasta entry in the TE sequence file (i.e. teseqs.txt).
It's propably best to first get a list of all the entries in the fasta file:
cat teseqs.txt|grep '^>'|perl -pe 's/>//' > te-hierarchy.txt
Than you may manually create a hierachy. Note that the three fields id, family and order are required. For details please see [Manual]