Manna Wiki

Brought to you by: filwierz, rokofler, sarahsignor

Manual

Introduction

This is the manual of Manna, a tool that allows to perform multiple alignments of (repeat) annotations. For example it may be used to align annotated TE insertions in piRNA clusters.

Prerequisites and requirements

The script requires a repeat annotation for each sequence of interest, ideally the Repeatmasker output should be used (suffix '.out').
Alternatively, to test the script with some simple toy examples a single column input (1 feature per line) will be accepted ('--input-format toy').
Python is required for Manna.

Installation

Installation is recommended by using subversion.
Go to a folder where you would like to install the tool and type the command provided at the code-tab https://sourceforge.net/p/manna/code

For example:

svn checkout https://svn.code.sf.net/p/manna/code/ manna-code

Alignment with manna ('cluster-msa.py')

In the example below, we use repeatmasker outputs for piRNA clusters 1 from three different samples (e.g. Drosophila strains 1, 2 and 3) and align them with the default parameters.

example call:

python manna-code/cluster-msa.py --clusters "sample1_cluster1.fasta.out,sample2_cluster1.fasta.out,sample3_cluster1.fasta.out" --sample-IDs "sample1,sample2,sample3" --cluster-ID "cluster1" > cluster1.msa

parameters:

--clusters: input files (comma separated)
--sample-IDs: names of samples (comma separated and same order as input files)
--gap: gap score (float)
--mm: mismatch score (float)
--match: match score (float)
--max-div: maximum divergence of repeatmasker annotations to be considered (float), features with higher divergence will be ingnored
--output-detail: [short|normal|long]
--input-format: [repeatmasker|toy] repeatmasker: prefix.out; toy: 1 column with 1 feature per line
--min-len: minimum length of feature to be considered, shorter features will be ignored
--cluster-ID: name of aligned sequence
--quick-rm: this is an advanced parameter, mainly for convenience. Instead of providing separate repeatmasker outputs we may provide a single repeat masker file obtained when concatening the sequences from multiple samples (e.g. cluster 42AB for three drosophila strains) and performing the repeat annotation on this single file. In this case the strains are distinguished by a certain column in the out-file. When using this parameter the two parameters --cluster and --sample-IDs need to be empty and provided like this: --clusters "" --sample-IDs ""

example outputs:

Depending on the 'output-detail' parameter, the first 10 lines of the final alignment that is written into 'cluster1.msa' looks like this:

short:

#Score: 18098.060000000005
#Samples    sample1 sample2 sample3
#ClusterID  cluster1
#TE-fam
ROXELEMENT  ROXELEMENT  ROXELEMENT
INE1    INE1    -
INE1    INE1    INE1
INE1    INE1    INE1
INE1    INE1    INE1
BS3 BS3 BS3

Line1: total alignment score
Line2: sample-ids in the same order as the alignment
Line3: cluster-id
Line4: column header of the alignment: feature name
Line5 - LineN: the alignment in short format. Solely the TE name is shown for each samples. In this example 3 samples were used. The order of the samples is the same as in Line2.

normal:

#Score: 18098.060000000005
#Samples    sample1 sample2 sample3
#ClusterID  cluster1
#TE-fam length  div
ROXELEMENT  235.0   25.5    ROXELEMENT  235.0   25.5    ROXELEMENT  235.0   25.5
INE1    52.0    11.8    INE1    52.0    11.8    -   -   -
INE1    70.0    14.7    INE1    70.0    14.7    INE1    70.0    14.7
INE1    115.0   21.0    INE1    115.0   21.0    INE1    115.0   23.5
INE1    123.0   18.2    INE1    123.0   18.2    INE1    123.0   18.2
BS3 170.0   4.1 BS3 170.0   4.7 BS3 168.0   3.6

Line1: total alignment score
Line2: sample-ids in the same order as the alignment
Line3: cluster-id
Line4: column header of the alignment: feature name, length of feature (in bp), divergence from reference (in %)
Line5 - LineN: the alignment in normal format. For each sample the TE name, the length of the TE insertions and the divergence (see Line4) is shown. In this example 3 samples were used. The order of the samples is the same as in Line2. With 3 samples and 3 features (Line4) we have 9 rows in total (3 x 3).

long:

#Score: 18098.060000000005
#Samples    sample1 sample2 sample3
#ClusterID  cluster1
#TE-fam clu_start   length  div score   'te_strand:te_start:te_end
ROXELEMENT  487.0   235.0   25.5    958.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    958.0   '-:4115..4357   ROXELEMENT  487.0   235.0   25.5    880.0   '-:4115..4357
INE1    722.0   52.0    11.8    228.0   '-:2..45    INE1    722.0   52.0    11.8    228.0   '-:2..45    -   -   -   -   -   -
INE1    982.0   70.0    14.7    390.0   '+:499..566 INE1    982.0   70.0    14.7    390.0   '+:499..566 INE1    982.0   70.0    14.7    367.0   '+:499..566
INE1    1052.0  115.0   21.0    397.0   '-:213..335 INE1    1052.0  115.0   21.0    397.0   '-:213..335 INE1    1052.0  115.0   23.5    365.0   '-:213..335
INE1    1101.0  123.0   18.2    448.0   '-:285..395 INE1    1101.0  123.0   18.2    448.0   '-:285..395 INE1    1101.0  123.0   18.2    417.0   '-:285..395
BS3 1273.0  170.0   4.1 1447.0  '+:472..641 BS3 1273.0  170.0   4.7 1427.0  '+:472..641 BS3 1273.0  168.0   3.6 1377.0  '+:472..639

Line1: total alignment score
Line2: sample-ids in the same order as the alignment
Line3: cluster-id
Line4: column header of the alignment: feature name, start position in query (bp, 1-based), length of feature (in bp), divergence from reference repeat (in %), Smith-Waterman score, orientation and position in reference repeat
Line5 - LineN: the alignment in full format, where all information used by Manna is shown. For each sample the TE name, the length of the TE insertions, the divergence etc (see Line4) is shown. In this example 3 samples were used. The order of the samples is the same as in Line2. With 3 samples and 6 features (Line4) we have 18 rows in total (3 x 6).

Wiki: Home