Here, we show a walkthrough for Manna's multiple sequence alignment of transposable element (TE) annotations of the piRNA cluster 42AB of 5 D. melanogaster lines.
We store the fasta files and the TE library in dedicated directories and set the directory for the Repeatmasker output.
mkdir fasta mv *.fasta fasta/. mkdir resources mv TE_dmel.fasta resources/. mkdir rm
Using the following command we can run Repeatmasker on all fasta files in the current working directory.
cd fasta for i in *fasta;do RepeatMasker -pa 20 -no_is -s -nolow -dir ../rm/ -lib ../resources/TE_dmel.fasta $i;done
After Repeatmasker completed its job, we need the Repeatmasker output files with the suffix 'fasta.out', which are stored in the rm directory.
These are the input files that are requrired for the Manna analysis.
Using the following command we can obtain the multiple sequence alignment of the Repeatmasker annotations.
python manna-code/cluster-msa.py --gap 0.09 --mm 0.1 --match 0.2 --input-format repeatmasker --output-detail long --clusters "Iso1_1.fasta.out,Pi2_1.fasta.out,Canton-S_1.fasta.out,DGRP732_1.fasta.out,A4_1.fasta.out" --sample-IDs "Iso1,Pi2,CS,D732,A4" --cluster-ID "1" > 1.msa
This results in a detailed output file. The first 10 lines of output are shown below.
#Score: 13034.129999999988 #Samples D732 A4 CS Iso1 Pi2 #ClusterID 1 #TE-fam clu_start length div score 'te_strand:te_start:te_end ROXELEMENT 487.0 235.0 25.5 964.0 '-:4115..4357 ROXELEMENT 487.0 235.0 25.5 964.0 '-:4115..4357 ROXELEMENT 487.0 235.0 25.5 964.0 '-:4115..4357 ROXELEMENT 487.0 235.0 25.5 964.0 '-:4115..4357 ROXELEMENT 487.0 235.0 25.5 964.0 '-:4115..4357 INE1 722.0 52.0 11.8 232.0 '-:2..45 INE1 722.0 52.0 11.8 232.0 '-:2..45 INE1 722.0 52.0 11.8 232.0 '-:2..45 INE1 722.0 52.0 11.8 232.0 '-:2..45 INE1 722.0 52.0 11.8 232.0 '-:2..45 INE1 966.0 86.0 12.0 382.0 '+:475..566 INE1 966.0 86.0 12.0 382.0 '+:475..566 INE1 966.0 86.0 12.0 382.0 '+:475..566 INE1 966.0 86.0 12.0 382.0 '+:475..566 INE1 966.0 86.0 12.0 382.0 '+:475..566 INE1 1049.0 118.0 21.4 415.0 '-:213..338 INE1 1049.0 118.0 21.4 415.0 '-:213..338 INE1 1049.0 118.0 21.4 415.0 '-:213..338 INE1 1049.0 118.0 21.4 415.0 '-:213..338 INE1 1049.0 118.0 21.4 415.0 '-:213..338 INE1 1101.0 123.0 18.2 460.0 '-:285..395 INE1 1101.0 123.0 18.2 460.0 '-:285..395 INE1 1101.0 123.0 18.2 460.0 '-:285..395 INE1 1101.0 123.0 18.2 460.0 '-:285..395 INE1 1101.0 123.0 18.2 460.0 '-:285..395 BS3 1273.0 170.0 4.7 1455.0 '+:472..641 BS3 1273.0 170.0 4.7 1455.0 '+:472..641 BS3 1273.0 170.0 4.1 1477.0 '+:472..641 BS3 1273.0 170.0 4.1 1477.0 '+:472..641 BS3 1273.0 170.0 4.1 1477.0 '+:472..641
The header consists of the first 4 lines staring with a '#':
line1: the alignment score of the multiple sequence alignment
line2: ordered sample IDs
line3: the clusterID
line4: information on the columns that are printed for each sample ID. In the 'long' output the TE-family name, the first position, the length, the divergence, the Smith-Waterman score of the annotation are reported. The last column reports the orientation of TE annotation (+/-) together with the first and last matching position in the TE sequence.
The header is followed by the resulting multiple sequence alignment in the order provided by the header.