MGEScan Code
Brought to you by:
wazimismail
| File | Date | Author | Commit |
|---|---|---|---|
| MGEScan_LTR | 2014-06-06 |
|
[b16533] Fixed memory leak in translate.c |
| MGEScan_nonLTR_v2 | 2014-06-06 |
|
[d7bf7e] Fixed memory leak in translate.c |
| .gitignore | 2014-05-06 |
|
[97cd41] Fixed a few memory issues |
| README | 2014-03-03 |
|
[b89065] Initial commit |
| run_MGEScan.pl | 2014-05-06 |
|
[97cd41] Fixed a few memory issues |
| splitMultiFasta.py | 2014-05-06 |
|
[97cd41] Fixed a few memory issues |
Installation
===============
To install MGEScan, follow the steps below:
1. Untar the downloaded file "MGEScan1.0.tar". This will automatically generate the
directory "MGEScan1.0".
Command: tar -xvf MGEScan1.0.tar
2. Install TANDEM REPEAT FINDER: http://tandem.bu.edu/trf/trf.html and add the path
in MGEScan1.0/MGEScan_LTR/path.conf file.
3. Install HMMER package and add the path of "hmmsearch" in your shell file such as
.bashrc. To make sure that "hmmsearch" is accessible by our program, type "hmmsearch" in
the directories "MGEScan1.0/MGEScan_LTR" and "MGEScan1.0/MGEScan_nonLTR_v2".
4. Install EMBOSS package and add the path of "transeq" in your shell file such as
.bashrc. To make sure that "transeq" is accessible by our program, type "transeq" in
the directories "MGEScan1.0/MGEScan_LTR" and "MGEScan1.0/MGEScan_nonLTR_v2".
5. Make sure you have a Perl Interpreter and C compiler such as g++.
6. Run "makefile" to compile "translate" and "MGEScan".
- In the MGEScan1.0/MGEScan_LTR/MER directory
Command: make clean
Command: make all
- In the MGEScan1.0/MGEScan_nonLTR_v2 directory
Command: make clean
Command: make translate
- In the MGEScan1.0/MGEScan_nonLTR_v2/hmm directory
Command: make clean
Command: make MGEScan
Configuration files (Only for MGEScan_LTR)
===========================================
1. Update the configuration file MGEScan1.0/MGEScan_LTR/path.conf
a. sw_trf: path for tandem repeat finder.
b. sw_rm (optional 1): path for repeatmasker if you want to preprocess
c. rm_dir(optional 1): path for the directory where repeatmasker results will be stored if you want to preprocess.
d. scaffold(optional2): path for the big file that has all scaffolds.
For example,
sw_trf=/home/mrho/sw/trf400.linux.exe
sw_rm=/home/mrho/sw/RepeatMasker/RepeatMasker
rm_dir=/home/mrho/genome/daphnia/rm/
scaffold=
2.Update the configuration file MGEScan1.0/MGEScan_LTR/value.conf.
a. min_dist: minimum distance(bp) between LTRs.
b. max_dist: maximum distance(bp) between LTRS
c. min_len_ltr: minimum length(bp) of LTR.
d. max_len_ltr: maximum length(bp) of LTR.
e. ltr_sim_condition: minimum similarity(%) for LTRs in an element.
f. cluster_sim_condition: minimum similarity(%) for LTRs in a cluster
g. len_condition: minimum length(bp) for LTRs aligned in local alignment.
For example, the default values are listed as follows.
min_dist=2000
max_dist=20000
min_len_ltr=130
max_len_ltr=2000
ltr_sim_condition=70
cluster_sim_condition=70
len_condition=70
Running the program
====================
To run MGEScan, follow the steps below:
1. Put genome files in a directory. You can put them in any directory since you will
specify the directory when you run the program. Please make sure that the files in this
directory contain a single sequence per file (NOT A MULTIFASTA).
2. Run run_MGEScan.pl. This perl script reads your genome files and runs the whole
process. You can type the following commands in a line. You need four parameters (genome
directory, output data directory, HMMER version and which program to run). Using full paths
such as "/home/Workshop/genome/" is required.
command: ./run_MGEScan.pl genome=[directory that has genomes] -data=[directory where the
output will be saved] -hmmerv=[HMMER version: 2 or 3] -program=[L or N or B]
Example: ./run_MGEScan.pl -genome=/home/example/genome/
-data=/home/example/data/ -hmmerv=3 -program=B
Note: The parameter "program" takes one of three values
- L : For running only MGEScan_LTR
- N : For running only MGEScan_nonLTR
- B : For running only both programs
Output
============
A. MGEScan_LTR:
Upon completion, MGEScan-LTR generates a file "ltr.out". This output file has information
about clusters and coordinates of LTR retrotransposons identified. Each cluster of LTR
retrotransposons starts with the head line of "[cluster_number]---------", followed by
the information of LTR retrotransposons in the cluster. The columns for LTR
retrotransposons are as follows.
1. LTR_id: unique id of LTRs identified. It consist of two components, sequence file name
and id in the file. For example, chr1_2 is the second LTR retrotransposon in the chr1 file.
2. start position of 5’ LTR.
3. end position of 5’ LTR.
4. start position of 3’ LTR.
5. end position of 3’ LTR.
6. strand: + or -.
7. length of 5’ LTR.
8. length of 3’ LTR.
9. length of the LTR retrotransposon.
10.TSD on the left side of the LTR retotransposons.
11.TSD on the right side of the LTR retrotransposons.
12.di(tri)nucleotide on the left side of 5’LTR
13.di(tri)nucleotide on the right side of 5’LTR
14.di(tri)nucleotide on the left side of 3’LTR
15.di(tri)nucleotide on the right side of 3’LTR
B. MGEScan_nonLTR:
Upon completion, MGEScan-nonLTR generates the directory, "info" in the data directory you
specified. In this "info" directory, two sub-directories ("full" and "validation") are
generated.
- The "full" directory is for storing sequences of elements. Each subdirectory in "full"
is the name of clade. In each directory of clade, the DNA sequences of nonLTRs identified
are listed. Each sequence is in fasta format. The header contains the position
information of TEs identified:
[genome_file_name]_[start position in the sequence]
For example, >chr1_333 means that this element start at 333bp in the "chr1" file.
- The "validation" directory is for storing Q values. In the files "en" and "rt", the
first column corresponds to the element name and the last column Q value.
License
============
Copyright (C) 2014 Mina Rho & Haixu Tang.
You may redistribute this software under the terms of the GNU General Public License.