Manna Wiki

Brought to you by: filwierz, rokofler, sarahsignor

Walkthrough

Introduction

Here, we show a walkthrough for Manna's multiple sequence alignment of transposable element (TE) annotations of the piRNA cluster 42AB of 5 D. melanogaster lines.

Requirements

reference library of TE sequences (is based on Quesneville 2005 (10.1371/journal.pcbi.0010022) and can be found here: https://sourceforge.net/projects/manna/files/walkthrough/resources/TE_dmel.fasta)
fasta files of sequences of interest (42AB sequences were obtained from publically available genome assemblies with following GenBank assembly accession: GCA_000001215.4 (Iso1), GCA_003401745.1 (A4), GCA_004798075.2 (DGRP732), GCA_015832445.1 (Canton-S) and GCA_015852585.1(Pi2). The 42AB sequences can be found here: https://sourceforge.net/projects/manna/files/walkthrough/fasta/)
Repeatmasker
python

Preparatory work

Setup directories

We store the fasta files and the TE library in dedicated directories and set the directory for the Repeatmasker output.

mkdir fasta
mv *.fasta fasta/.

mkdir resources
mv TE_dmel.fasta resources/.

mkdir rm

Annotating TEs with Repeatmasker

Using the following command we can run Repeatmasker on all fasta files in the current working directory.

cd fasta
for i in *fasta;do RepeatMasker -pa 20 -no_is -s -nolow -dir ../rm/ -lib ../resources/TE_dmel.fasta $i;done

After Repeatmasker completed its job, we need the Repeatmasker output files with the suffix 'fasta.out', which are stored in the rm directory.
These are the input files that are requrired for the Manna analysis.

Multiple sequence alignment

Using the following command we can obtain the multiple sequence alignment of the Repeatmasker annotations.

python manna-code/cluster-msa.py --gap 0.09 --mm 0.1 --match 0.2 --input-format repeatmasker --output-detail long --clusters "Iso1_1.fasta.out,Pi2_1.fasta.out,Canton-S_1.fasta.out,DGRP732_1.fasta.out,A4_1.fasta.out" --sample-IDs "Iso1,Pi2,CS,D732,A4" --cluster-ID "1" > 1.msa

This results in a detailed output file. The first 10 lines of output are shown below.

#Score: 13034.129999999988
#Samples    D732    A4  CS  Iso1    Pi2
#ClusterID  1
#TE-fam clu_start   length  div score   'te_strand:te_start:te_end
ROXELEMENT  487.0   235.0   25.5    964.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    964.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    964.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    964.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    964.0   '-:4115..4357
INE1    722.0   52.0    11.8    232.0   '-:2..45    INE1    722.0   52.0    11.8    232.0   '-:2..45    INE1    722.0   52.0    11.8    232.0   '-:2..45    INE1    722.0   52.0    11.8    232.0   '-:2..45    INE1    722.0   52.0    11.8    232.0   '-:2..45
INE1    966.0   86.0    12.0    382.0   '+:475..566 INE1    966.0   86.0    12.0    382.0   '+:475..566 INE1    966.0   86.0    12.0    382.0   '+:475..566 INE1    966.0   86.0    12.0    382.0   '+:475..566 INE1    966.0   86.0    12.0    382.0   '+:475..566
INE1    1049.0  118.0   21.4    415.0   '-:213..338 INE1    1049.0  118.0   21.4    415.0   '-:213..338 INE1    1049.0  118.0   21.4    415.0   '-:213..338 INE1    1049.0  118.0   21.4    415.0   '-:213..338 INE1    1049.0  118.0   21.4    415.0   '-:213..338
INE1    1101.0  123.0   18.2    460.0   '-:285..395 INE1    1101.0  123.0   18.2    460.0   '-:285..395 INE1    1101.0  123.0   18.2    460.0   '-:285..395 INE1    1101.0  123.0   18.2    460.0   '-:285..395 INE1    1101.0  123.0   18.2    460.0   '-:285..395
BS3 1273.0  170.0   4.7 1455.0  '+:472..641 BS3 1273.0  170.0   4.7 1455.0  '+:472..641 BS3 1273.0  170.0   4.1 1477.0  '+:472..641 BS3 1273.0  170.0   4.1 1477.0  '+:472..641 BS3 1273.0  170.0   4.1 1477.0  '+:472..641

The header consists of the first 4 lines staring with a '#':
* line1: the alignment score of the multiple sequence alignment
* line2: ordered sample IDs
* line3: the clusterID
* line4: information on the columns that are printed for each sample ID. In the 'long' output the TE-family name, the first position, the length, the divergence, the Smith-Waterman score of the annotation are reported. The last column reports the orientation of TE annotation (+/-) together with the first and last matching position in the TE sequence.

The header is followed by the resulting multiple sequence alignment in the order provided by the header.

Wiki: Home