---------------------------------
MSProGene
---------------------------------
MSProGene is a Java program that constructs customized transcript databases
for tandem mass spectra search. RNA-Seq data is used to generate transcripts and to
resolve shared peptide protein inference in a proteogenomic network.
---------------------------------
Copyright (c) 2014,
Franziska Zickmann,
ZickmannF@rki.de, Robert Koch-Institute, Berlin, Germany
Distributed under the GNU Lesser General Public License, version 3.0.
When using MSProGene, please cite the following manuscript:
MSProGene - Integrative proteogenomics beyond six-frames and single nucleotide polymorphisms
Franziska Zickmann and Bernhard Y. Renard
submitted
---------------------------------
INSTALLATION
---------------------------------
MSProGene is designed to run on a linux system with the following minimum requirements for installed software:
- Python (http://www.python.org/), as well as the packages pysam and matplotlib
- Java 7 (http://www.java.com)
- Gnu R (http://www.r-project.org/), as well as the mixtools package
- MSGF+ (http://proteomics.ucsd.edu/Software/MSGFPlus/)
- the CPLEX Optimizer (http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/)
(free for academic use) -> a version of MSProGene applicable with GLPK will follow soon
For CPLEX, the path of the file "cplex.jar" and the cplex Djava.library.path
have to be passed as parameters in each MSProGene run (refer to parameter description below).
To install MSProGene, download the compressed file MSProGene.tar.gz from http://sourceforge.net/projects/msprogene/ and unpack the package with:
> tar MSProGene.tar.gz
This creates a folder named "MSProGene" in your current directory. This folder includes the executable MSProGene.jar.
To receive the help message of MSProGene, type:
> java -jar MSProGene/MSProGene.jar --help
Note that MSProGene needs several helper scripts to call external programs, these scripts are included
in the directory MSProGene/scripts/. To run MSProGene it is necessary that this folder is always in the same directory as
the file MSProGene.jar or that the full path to the directory containing "scripts" is specified with the parameter "-scripts".
---------------------------------
RUN MSProGene - EXAMPLE
---------------------------------
In the following example we assume that the file MSProGene.jar is contained in the directory foo/.
Further, we assume that the path to the file "cplex.jar" and to the file libcplex124.so (required as Djava.library.path for cplex) is foo_CPLEX/.
The test data files can be downloaded from http://sourceforge.net/projects/msprogene/files/testdata/.
Please adapt the file example_msgf_setting.txt contained in this folder according to the path of your MSGF+ and MSProGene installation
(refer to details on file format below). In the following we assume that the folder testdata is contained in foo/.
In the example, we use the file "example_transcript_file.fasta" that contains already constructed transcripts for the
Litomosoides sigmodontis dataset analyzed in the MSProGene paper (refer to details on file format below).
> cd foo/
> java -Xmx10000m -jar MSProGene.jar -libPath foo_CPLEX/ -cp foo_CPLEX/cplex.jar -transcriptFile testdata/example_transcript_file.fasta -sF testdata/exampleSpectra.mgf -spectraSearchSetting testdata/example_msgf_setting.txt -out MSProGene_example/ -outName test
A directory named MSProGene_example should appear in "foo/" that contains temporary and output files generated by MSProGene.
Predicted protein coordinates are shown in the GTF file proGenePrediction_test.gtf. For a first post processing you can use the
script parseMSProGenePrediction.py (contained in MSProGene/scripts/) that distinguishes between predictions with and without spectra hits
and counts the spectra support.
---------------------------------
PARAMETERS OF MSProGene
---------------------------------
General information:
1) At the moment, MSProGene requires the CPLEX optimizer (free for academic use) to solve the linear program (an alternative version using GLPK will be available soon).
Please provide the absolute path to the directory containing the file libcplex124.so as well as to the file cplex.jar (included in the directory of your CPLEX installation):
> java -jar MSProGene.jar -libPath PATH_TO_CPLEX/ -cp PATH_TO_CPLEX/cplex.jar
2) Depending on the size of your dataset, you might have to assign more memory to MSProGene to avoid an out of memory error.
To do so, set a higher Xmx value when calling MSProGene, e.g. 5GB (="5000m"):
> java -Xmx5000m -jar MSProGene.jar
options:
-h : help text and exit
-sF [specFile] : contains spectra in mgf format (MANDATORY)
-spectraSearchSetting [PATH] : path to file that contains the parameters for the MSGF+ search, refer to section "FILE FORMATS" below for more information. (MANDATORY)
-libPath [PATH] : specify the absolute path to the directory containing libcplex124.so (required as Djava.library.path for cplex). (MANDATORY)
-cp [PATH] : the absolute path to the cplex jar file cplex.jar (MANDATORY)
-scripts [PATH] : the absolute path to the directory containing the required helper scripts, DEFAULT: directory of MSProGene.jar
-out [PATH] : specify the directory that shall contain the results files, DEFAULT: current directory
-outName [outputName] : desired name for output files, DEFAULT: genes
-prokaryote : if specified, genome is treated as prokaryotic, no spliced reads are accepted, and structural genes are resolved. DEFAULT: turned off
-writeOutTranscriptFile : if specified, a fasta file with predicted GIIRA transcripts is written for future analyses.
-transcriptFile [PATH] : specify a fasta file containing previously predicted transcripts (optional).
refer to section "FILE FORMATS" below for more information
-nT [numberThreads] : specify the maximal number of threads that are allowed to be used, DEFAULT: 1
-cutoff [double] : define the false discovery rate threshold for the spectra search, DEFAULT: 0.01
-noExtraDecoy : if specified, the spectra search is performed without a decoy database, DEFAULT: turned off.
-minPepLength [int] : specify minimum peptide length, DEFAULT: 5. If specified in MSGF+ setting, value is taken from there.
-snpVCFfile [PATH] : specify a vcf file containing previously called SNPs (optional). Contained SNPs are integrated to the reference sequence (that needs to be specified with parameter -iG).
-snpThreshold [int] : specify threshold for SNP integration, DEFAULT: 2.
-conta_DB [PATH]: specify path and name of a fasta file containing contamination sequences of ms experiments (if none provided, simply left out).
---------------------------------
FILE FORMATS
---------------------------------
Transcript file (example file in MSProGene/exampleFiles/):
Previously predicted transcripts or gene sequences can be presented to MSProGene in fasta format. All sequences have to be extracted from the
forward strand, regardless whether they are forward or reverse direction (MSProGene handles the direction internally).
The header of each sequence has to include the following features
- contig XXX - the tag "contig" needs to be followed by a whitespace and the contig name of the contig/chromosome the transcript originates from
- id XXX - the tag "id" needs to be followed by a whitespace and an identifier for the transcripts (no restrictions on usage of letters, numbers and symbols)
- strand XXX - the tag "strand" needs to be followed by a whitespace and an indication of the direction of this transcript. The identifier can be either "forward", "+", or "1" to indicate forward direction, or "reverse", "-", or "0" to indicate reverse direction. If no direction information is available, indicate this by a ".".
- score XXX - (optional): the tag "score" needs to be followed by a whitespace and a number indicating the reliability of this prediction (scores are scaled afterwards, so scores can be in any range)
- begin XXX - the tag "begin" needs to be followed by a whitespace and the start position of the transcript
- end XXX - the tag "end" needs to be followed by a whitespace and the end position of the transcript
- exons ( XXX,XXX XXX,XXX ) - the tag "exons" needs to be followed by a whitespace and a series of pairs of start and end positions of exons (in brackets) associated to this transcript. Each exon is separated by a whitespcae from the next pair, and each start position is separated by a comma
Setting file for MSGF+ (example file in MSProGene/exampleFiles/):
Indicate the settings for MSGF+ and in particular the path to the MSGF+ installation on your system and the location of the MSGF+ modification file (if one is desired).
The format is a standard program call of MSGF+, with whitespaces between the parameters.