Menu

Tree [144316] master /
 History

HTTPS access


File Date Author Commit
 Config 2012-03-13 mrorejuela mrorejuela [9b44ad] Initial upload
 Lib 2012-03-14 mrorejuela mrorejuela [144316] Solving null results issue
 SQL 2012-03-13 mrorejuela mrorejuela [9b44ad] Initial upload
 test 2012-03-14 mrorejuela mrorejuela [81693b] Adding a new test file
 README.txt 2012-03-14 mrorejuela mrorejuela [44de41] Update of the README
 metaProt.py 2012-03-13 mrorejuela mrorejuela [9b44ad] Initial upload

Read Me

--------------------------------------------------------------------------------
|                                 metaProt                                     |
--------------------------------------------------------------------------------

1. INTRODUCTION
2. DEPENDENCIES AND INSTALLATION
    2.1 DEPENDENCIES
    2.2 INSTALLATION
    2.3 EXTERNAL PROGRAMS
    2.4 DATABASES
    2.5 DIRECTORY STRUCTURE
3. USAGE
    3.1 INPUT FILES
    3.2 CONFIG AND SQL FILES
    3.3 OPTIONS
    3.4 OUTPUT FILES
4. FUTURE WORK


1. INTRODUCTION
--------------------------------------------------------------------------------
metaProt is a python pipeline to analyze protein data from metagenomic projects 
in order to describe the sample with the maximum amount of data available.

Nowadays, there is a huge amount of really informative tools to characterize 
sequencing data, but sometimes one is not enough and at the end, lots of them
are used in an integrative way. To simplify and automatizate the process, we
can use pipeline scripts/programs, as metaProt.

The default pipeline has been implemented having in mind proteins, but if you 
wish to use nucleotide sequences, feel free to download the databases you want to
use with tools integrated and follow the instructions below to adequate the 
pipeline ( 3.2 CONFIG AND SQL FILES )

This software is released under the GNU General Public License (GPL).

2. DEPENDENCIES AND INSTALLATION
--------------------------------------------------------------------------------
2.1 DEPENDENCIES

    The program is written and tested in Python2.7 . It uses standard built-in 
    modules except for the ones listed below that should be installed 
    independently by the user:

    - Numpy (http://numpy.scipy.org/)
    - Matplotlib
    - Biopython
    - sqlite3


2.2 INSTALLATION
    
    Unpack the tarball to the folder of your preference.

        tar xzf metaProt_XXX.tar.gz

    In order to execute the program from any directory, add:

        export PATH=$PATH:/installation/path/metaProt_XXX

    Remember to add some databases to the database/ directory.
    Two HMMer databases are available for testing. If you want to use
    them, download databases.tar.gz and unpack them:

        tar xzf databases.tar.gz
        mv databases/ metaProt_XXX/
    
    Command line:

        python metaProt.py [options]


2.3 EXTERNAL PROGRAMS
    
    metaProt uses the following external programs:
                    
        - HMMer:  
                    A HMM suite of tools, available either from
                    the repositories of the major linux distributions
                    of from the official website.
                    
                    root@computer: apt-get install hmmer
                    
          http://hmmer.janelia.org/
                    
        - Pepstats:
                   General information about the composition of proteins
                   You can install it from the repositories installing the
                   emboss suite.
                   
                   root@computer: apt-get install emboss
                   
          http://emboss.sourceforge.net/apps/cvs/emboss/apps/pepstats.html            

2.4 DATABASES

    2.4.1 HMMer databases
         You can download them already formated or format them yourself. 
	 You must have all the HMM in a single file and compress them.

         EX:
            wget ftp.jcvi.org/pub/data/TIGRFAMs/TIGRFAMs_12.0_HMM.tar.gz
            ls -1 TG/* | while read file; do cat $file >> TIGRFAMS.hmm; done
            hmmpress TIGRFAMS.hmm 
    
         
2.5 DIRECTORY STRUCTURE

    metaProt.py
    \___ Main script, where the magic is made!
    
    Config/
    \___ Config directory where the default config files to define the databases 
         and programs are stored
         
    Lib/
    \___ Libraries and modules directory
    
    \___ __init__py
         \___  Init file to invoque the libraries
          
    \___ DB.py
    	 \___ SQLite class. Controls the creation, maintainment and interaction
    	      of the program with the sqlite database.

    \___ functions
	 \___ General functions used in the main() script.
	      
    \___ handler	
	     \___ Handler class. Controls the flow of the program executes the 
	          different associated programs / scripts and parses the outputs.
	              	      
    \___ hmm.py
    	 \___ HMMer class. Controls the parameters, execution and parsing of 
    	       the HMMer results
    	       
    \___ pepstats.py
         \___ Pepstats class. Controls the execution and parsing of pepstats
               analysis.

    \___ progress_bar.py 
	 \___ Library to create dynamic progress bars in terminals. Downloaded 
              from
	          
    \___ statistics.py
	 \___ Library to plot data retrieved from the database in several forms 
	     (histograms barplots and boxplots).
	          
    SQL/
    \___ Directory of SQL scripts to initiate the database tables


    TMP/
    \___ Directory for temporary files create in the intermediate step of the 
         pipeline 
    
    databases/
    \___ Default directory where the different databases should be located.
     
3. USAGE
--------------------------------------------------------------------------------
    3.1 INPUT FILES
    
        metaProt takes as input a standard formated (multi)fasta file with protein 
	sequences. Take into account that the program, by default, uses pepstats as 
        part of the pipeline. If you want to use it with nucleotide sequences, 
        adequate the databases and remove the pepstats line from the config.cfg 
	file (or create a new config file and use it by command line )

    3.2 CONFIG AND SQL FILES
        
	3.1 CONFIG FILES

	The configuration file, found in Config/ by default, specifies which 
	programs and with which databases the pipeline has to run. The format is very
        simple, but you must specify the names without any ambiguity. The structure 
        is as follows:

		PROGRAM_NAME: the first column specifies the name of the program to be 
                              used. At this moment only two are available.
			OPTIONS: HMMer
				 PEPSTATS
		DATABASE_NAME: name to be used in the SQLite database. This name must 
                               not have any white space and has to correspond EXACTLY 
                               to a SQLite file in the SQL/ directory.

		DATABASE_PATH: relative or absolute (prefered) path to the database to 
                               be used. It refers to HMM databases, blastable databases... 
                               etc. Avoid white spaces
        

	3.2 SQL FILES
	
	SQL files are used to initializate the tables to be used later in the pipeline. 
	If you want to add new databases for an existing tool, just copy the adequate 
        file and change the name.
	For instance, for a HMMer new database named MOTIFS, you should:
		cp SQL/PFAM.sql SQL/MOTIFS.sql

	and specify MOTIFS as a database in your config.cfg file

	Two examples of HMMer databases (TIGRFAMS.sql, PFAM.sql) are provided as well 
	as the one for the main table and the one for the Pepstats tool.

   	3.3 OPTIONS

        Usage: metaProt.py [options]

        Options:
            -h, --help            show this help message and exit
            -c CONFIG, --config=CONFIG
                        Configuration file
            -f INFILE, --file=INFILE
                        Sequences file (Fasta format)
            -p PROJECT, --project=PROJECT
                        Project tag name (output directory name)
            -n, --non-processing  Use a database already created without reprocessing a
                        fasta file
   
    3.4 OUTPUT FILES
        The output files will be found in the project directory, as well as a folder with 
	plots. 
	Each input sequence will have its analysis results in its own file. Each section 
	of data is divided by its name preceded by a # and a line with the headers of each 
	column.
	
	Ex:
	# TIGRFAMS
	# ID	evalue	...
        
        
4. FUTURE WORK
--------------------------------------------------------------------------------
	+ Add new tools to the pipeline
	+ Add an annotator module as described in the VMGAP 
	  (http://standardsingenomics.org/index.php/sigen/article/view/sigs.1694706/586)
	+ Refine the plotting step
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.