metaProt Code
Pipeline to analyze coding regions in metagenomic projects
Status: Alpha
Brought to you by:
mrorejuela
File | Date | Author | Commit |
---|---|---|---|
Config | 2012-03-13 |
![]() |
[9b44ad] Initial upload |
Lib | 2012-03-14 |
![]() |
[144316] Solving null results issue |
SQL | 2012-03-13 |
![]() |
[9b44ad] Initial upload |
test | 2012-03-14 |
![]() |
[81693b] Adding a new test file |
README.txt | 2012-03-14 |
![]() |
[44de41] Update of the README |
metaProt.py | 2012-03-13 |
![]() |
[9b44ad] Initial upload |
-------------------------------------------------------------------------------- | metaProt | -------------------------------------------------------------------------------- 1. INTRODUCTION 2. DEPENDENCIES AND INSTALLATION 2.1 DEPENDENCIES 2.2 INSTALLATION 2.3 EXTERNAL PROGRAMS 2.4 DATABASES 2.5 DIRECTORY STRUCTURE 3. USAGE 3.1 INPUT FILES 3.2 CONFIG AND SQL FILES 3.3 OPTIONS 3.4 OUTPUT FILES 4. FUTURE WORK 1. INTRODUCTION -------------------------------------------------------------------------------- metaProt is a python pipeline to analyze protein data from metagenomic projects in order to describe the sample with the maximum amount of data available. Nowadays, there is a huge amount of really informative tools to characterize sequencing data, but sometimes one is not enough and at the end, lots of them are used in an integrative way. To simplify and automatizate the process, we can use pipeline scripts/programs, as metaProt. The default pipeline has been implemented having in mind proteins, but if you wish to use nucleotide sequences, feel free to download the databases you want to use with tools integrated and follow the instructions below to adequate the pipeline ( 3.2 CONFIG AND SQL FILES ) This software is released under the GNU General Public License (GPL). 2. DEPENDENCIES AND INSTALLATION -------------------------------------------------------------------------------- 2.1 DEPENDENCIES The program is written and tested in Python2.7 . It uses standard built-in modules except for the ones listed below that should be installed independently by the user: - Numpy (http://numpy.scipy.org/) - Matplotlib - Biopython - sqlite3 2.2 INSTALLATION Unpack the tarball to the folder of your preference. tar xzf metaProt_XXX.tar.gz In order to execute the program from any directory, add: export PATH=$PATH:/installation/path/metaProt_XXX Remember to add some databases to the database/ directory. Two HMMer databases are available for testing. If you want to use them, download databases.tar.gz and unpack them: tar xzf databases.tar.gz mv databases/ metaProt_XXX/ Command line: python metaProt.py [options] 2.3 EXTERNAL PROGRAMS metaProt uses the following external programs: - HMMer: A HMM suite of tools, available either from the repositories of the major linux distributions of from the official website. root@computer: apt-get install hmmer http://hmmer.janelia.org/ - Pepstats: General information about the composition of proteins You can install it from the repositories installing the emboss suite. root@computer: apt-get install emboss http://emboss.sourceforge.net/apps/cvs/emboss/apps/pepstats.html 2.4 DATABASES 2.4.1 HMMer databases You can download them already formated or format them yourself. You must have all the HMM in a single file and compress them. EX: wget ftp.jcvi.org/pub/data/TIGRFAMs/TIGRFAMs_12.0_HMM.tar.gz ls -1 TG/* | while read file; do cat $file >> TIGRFAMS.hmm; done hmmpress TIGRFAMS.hmm 2.5 DIRECTORY STRUCTURE metaProt.py \___ Main script, where the magic is made! Config/ \___ Config directory where the default config files to define the databases and programs are stored Lib/ \___ Libraries and modules directory \___ __init__py \___ Init file to invoque the libraries \___ DB.py \___ SQLite class. Controls the creation, maintainment and interaction of the program with the sqlite database. \___ functions \___ General functions used in the main() script. \___ handler \___ Handler class. Controls the flow of the program executes the different associated programs / scripts and parses the outputs. \___ hmm.py \___ HMMer class. Controls the parameters, execution and parsing of the HMMer results \___ pepstats.py \___ Pepstats class. Controls the execution and parsing of pepstats analysis. \___ progress_bar.py \___ Library to create dynamic progress bars in terminals. Downloaded from \___ statistics.py \___ Library to plot data retrieved from the database in several forms (histograms barplots and boxplots). SQL/ \___ Directory of SQL scripts to initiate the database tables TMP/ \___ Directory for temporary files create in the intermediate step of the pipeline databases/ \___ Default directory where the different databases should be located. 3. USAGE -------------------------------------------------------------------------------- 3.1 INPUT FILES metaProt takes as input a standard formated (multi)fasta file with protein sequences. Take into account that the program, by default, uses pepstats as part of the pipeline. If you want to use it with nucleotide sequences, adequate the databases and remove the pepstats line from the config.cfg file (or create a new config file and use it by command line ) 3.2 CONFIG AND SQL FILES 3.1 CONFIG FILES The configuration file, found in Config/ by default, specifies which programs and with which databases the pipeline has to run. The format is very simple, but you must specify the names without any ambiguity. The structure is as follows: PROGRAM_NAME: the first column specifies the name of the program to be used. At this moment only two are available. OPTIONS: HMMer PEPSTATS DATABASE_NAME: name to be used in the SQLite database. This name must not have any white space and has to correspond EXACTLY to a SQLite file in the SQL/ directory. DATABASE_PATH: relative or absolute (prefered) path to the database to be used. It refers to HMM databases, blastable databases... etc. Avoid white spaces 3.2 SQL FILES SQL files are used to initializate the tables to be used later in the pipeline. If you want to add new databases for an existing tool, just copy the adequate file and change the name. For instance, for a HMMer new database named MOTIFS, you should: cp SQL/PFAM.sql SQL/MOTIFS.sql and specify MOTIFS as a database in your config.cfg file Two examples of HMMer databases (TIGRFAMS.sql, PFAM.sql) are provided as well as the one for the main table and the one for the Pepstats tool. 3.3 OPTIONS Usage: metaProt.py [options] Options: -h, --help show this help message and exit -c CONFIG, --config=CONFIG Configuration file -f INFILE, --file=INFILE Sequences file (Fasta format) -p PROJECT, --project=PROJECT Project tag name (output directory name) -n, --non-processing Use a database already created without reprocessing a fasta file 3.4 OUTPUT FILES The output files will be found in the project directory, as well as a folder with plots. Each input sequence will have its analysis results in its own file. Each section of data is divided by its name preceded by a # and a line with the headers of each column. Ex: # TIGRFAMS # ID evalue ... 4. FUTURE WORK -------------------------------------------------------------------------------- + Add new tools to the pipeline + Add an annotator module as described in the VMGAP (http://standardsingenomics.org/index.php/sigen/article/view/sigs.1694706/586) + Refine the plotting step