--------------------------------------------------------------------------------
| metaProt |
--------------------------------------------------------------------------------
1. INTRODUCTION
2. DEPENDENCIES AND INSTALLATION
2.1 DEPENDENCIES
2.2 INSTALLATION
2.3 EXTERNAL PROGRAMS
2.4 DATABASES
2.5 DIRECTORY STRUCTURE
3. USAGE
3.1 INPUT FILES
3.2 CONFIG AND SQL FILES
3.3 OPTIONS
3.4 OUTPUT FILES
4. FUTURE WORK
1. INTRODUCTION
--------------------------------------------------------------------------------
metaProt is a python pipeline to analyze protein data from metagenomic projects
in order to describe the sample with the maximum amount of data available.
Nowadays, there is a huge amount of really informative tools to characterize
sequencing data, but sometimes one is not enough and at the end, lots of them
are used in an integrative way. To simplify and automatizate the process, we
can use pipeline scripts/programs, as metaProt.
The default pipeline has been implemented having in mind proteins, but if you
wish to use nucleotide sequences, feel free to download the databases you want to
use with tools integrated and follow the instructions below to adequate the
pipeline ( 3.2 CONFIG AND SQL FILES )
This software is released under the GNU General Public License (GPL).
2. DEPENDENCIES AND INSTALLATION
--------------------------------------------------------------------------------
2.1 DEPENDENCIES
The program is written and tested in Python2.7 . It uses standard built-in
modules except for the ones listed below that should be installed
independently by the user:
- Numpy (http://numpy.scipy.org/)
- Matplotlib
- Biopython
- sqlite3
2.2 INSTALLATION
Unpack the tarball to the folder of your preference.
tar xzf metaProt_XXX.tar.gz
In order to execute the program from any directory, add:
export PATH=$PATH:/installation/path/metaProt_XXX
Remember to add some databases to the database/ directory.
Two HMMer databases are available for testing. If you want to use
them, download databases.tar.gz and unpack them:
tar xzf databases.tar.gz
mv databases/ metaProt_XXX/
Command line:
python metaProt.py [options]
2.3 EXTERNAL PROGRAMS
metaProt uses the following external programs:
- HMMer:
A HMM suite of tools, available either from
the repositories of the major linux distributions
of from the official website.
root@computer: apt-get install hmmer
http://hmmer.janelia.org/
- Pepstats:
General information about the composition of proteins
You can install it from the repositories installing the
emboss suite.
root@computer: apt-get install emboss
http://emboss.sourceforge.net/apps/cvs/emboss/apps/pepstats.html
2.4 DATABASES
2.4.1 HMMer databases
You can download them already formated or format them yourself.
You must have all the HMM in a single file and compress them.
EX:
wget ftp.jcvi.org/pub/data/TIGRFAMs/TIGRFAMs_12.0_HMM.tar.gz
ls -1 TG/* | while read file; do cat $file >> TIGRFAMS.hmm; done
hmmpress TIGRFAMS.hmm
2.5 DIRECTORY STRUCTURE
metaProt.py
\___ Main script, where the magic is made!
Config/
\___ Config directory where the default config files to define the databases
and programs are stored
Lib/
\___ Libraries and modules directory
\___ __init__py
\___ Init file to invoque the libraries
\___ DB.py
\___ SQLite class. Controls the creation, maintainment and interaction
of the program with the sqlite database.
\___ functions
\___ General functions used in the main() script.
\___ handler
\___ Handler class. Controls the flow of the program executes the
different associated programs / scripts and parses the outputs.
\___ hmm.py
\___ HMMer class. Controls the parameters, execution and parsing of
the HMMer results
\___ pepstats.py
\___ Pepstats class. Controls the execution and parsing of pepstats
analysis.
\___ progress_bar.py
\___ Library to create dynamic progress bars in terminals. Downloaded
from
\___ statistics.py
\___ Library to plot data retrieved from the database in several forms
(histograms barplots and boxplots).
SQL/
\___ Directory of SQL scripts to initiate the database tables
TMP/
\___ Directory for temporary files create in the intermediate step of the
pipeline
databases/
\___ Default directory where the different databases should be located.
3. USAGE
--------------------------------------------------------------------------------
3.1 INPUT FILES
metaProt takes as input a standard formated (multi)fasta file with protein
sequences. Take into account that the program, by default, uses pepstats as
part of the pipeline. If you want to use it with nucleotide sequences,
adequate the databases and remove the pepstats line from the config.cfg
file (or create a new config file and use it by command line )
3.2 CONFIG AND SQL FILES
3.1 CONFIG FILES
The configuration file, found in Config/ by default, specifies which
programs and with which databases the pipeline has to run. The format is very
simple, but you must specify the names without any ambiguity. The structure
is as follows:
PROGRAM_NAME: the first column specifies the name of the program to be
used. At this moment only two are available.
OPTIONS: HMMer
PEPSTATS
DATABASE_NAME: name to be used in the SQLite database. This name must
not have any white space and has to correspond EXACTLY
to a SQLite file in the SQL/ directory.
DATABASE_PATH: relative or absolute (prefered) path to the database to
be used. It refers to HMM databases, blastable databases...
etc. Avoid white spaces
3.2 SQL FILES
SQL files are used to initializate the tables to be used later in the pipeline.
If you want to add new databases for an existing tool, just copy the adequate
file and change the name.
For instance, for a HMMer new database named MOTIFS, you should:
cp SQL/PFAM.sql SQL/MOTIFS.sql
and specify MOTIFS as a database in your config.cfg file
Two examples of HMMer databases (TIGRFAMS.sql, PFAM.sql) are provided as well
as the one for the main table and the one for the Pepstats tool.
3.3 OPTIONS
Usage: metaProt.py [options]
Options:
-h, --help show this help message and exit
-c CONFIG, --config=CONFIG
Configuration file
-f INFILE, --file=INFILE
Sequences file (Fasta format)
-p PROJECT, --project=PROJECT
Project tag name (output directory name)
-n, --non-processing Use a database already created without reprocessing a
fasta file
3.4 OUTPUT FILES
The output files will be found in the project directory, as well as a folder with
plots.
Each input sequence will have its analysis results in its own file. Each section
of data is divided by its name preceded by a # and a line with the headers of each
column.
Ex:
# TIGRFAMS
# ID evalue ...
4. FUTURE WORK
--------------------------------------------------------------------------------
+ Add new tools to the pipeline
+ Add an annotator module as described in the VMGAP
(http://standardsingenomics.org/index.php/sigen/article/view/sigs.1694706/586)
+ Refine the plotting step