=========================================================
1. VERSION: gpredgc.2.4.1.src
=========================================================
GPRED-GC is a Gene PREDiction program accounting for 5'-3' GC gradient.
For any questions, comments, suggestions, etc, please send emails to yannisun@msu.edu.
=========================================================
2. INSTALLATION
=========================================================
1) unpack
>tar -xzf gpredgc.2.4.1.src.tar.gz
The tar-archive contains one directory 'gpredgc.2.4.1.src' with the following
sub-directories:
bin
src
include
config
examples
scripts
docs
datasets
2) compile (if not already compiled)
> cd src
> make
3) set environment variable GPREDGC_CONFIG_PATH
> export GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/
The program requires that the environment variable GPREDGC_CONFIG_PATH is set to the config directory that contains the
configuration and parameter files. This is the directory 'gpredgc.2.4.1.src/config'. You probably want to add this line to a startup script (like ~/.bashrc).
Alternatively, you can specify this directory on the command line when you run gpredgc:
--GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/
You may want to add the path of the executable to the PATH environment variable or copy gpredgc into a common directory (e.g. /usr/bin/).
=========================================================
3. TRAINING GPRED-GC
You can choose to run either etraining or pipeline.py
=========================================================
3.1 The program 'etraining' reads the meta parameters from the .cfg file and a genbank file with annotated genes and writes the other species specific parameters into the 3 .pbl files: species_exon_probs.pbl, species _intron_probs.pbl, and species_igenic_probs.pbl. Meta parameter file contains parameters for each species such as the order of the Markov model, the size of the window used for the splice site models, and so on.
***The transition probabilities were divided into three equal parts.***
Usage:
etraining --species=SPECIES --lowT=0.39 --highT=0.61 trainfilename --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config
'--lowT' is the cutoff of low GC content.
'--highT' is the cutoff of high GC content.
'trainfilename' is the filename (including relative path) to the file in genbank format containing the training sequences.
These can be multi-gene sequences and genes on the reverse strand. However, the genes must not overlap and only one transcript is allowed.
For example,
etraining --species=rice_megan_sample --lowT=0.39 --highT=0.61 /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.train --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config
GPRED-GC trains all values of probabilities from rice_megan_sample and rice_megan_sample.gb.train and outputs three .pbl files: rice_megan_sample_exon_probs.pbl, rice_megan_sample_intron_probs.pbl, and rice_megan_sample_igenic_probs.pbl.
=========================================================
3.2 The program 'pipeline.py' reads the meta parameters from the .cfg file and a genbank file with annotated genes and writes the other species specific parameters into the 3 .pbl files: species_exon_probs.pbl, species _intron_probs.pbl, and species_igenic_probs.pbl. Meta parameter file contains parameters for each species such as the order of the Markov model, the size of the window used for the splice site models, and so on.
***The transition probabilities were trained using maximum likelihood estimation.***
Usage:
python pipeline.py --species=SPECIES --lowT=0.39 --highT=0.61 trainfilename --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/
'--lowT' is the cutoff of low GC content.
'--highT' is the cutoff of high GC content.
'trainfilename' is the filename (including relative path) to the file in genbank format containing the training sequences.
These can be multi-gene sequences and genes on the reverse strand. However, the genes must not overlap and only one transcript is allowed.
For example,
python pipeline.py --species=rice_megan_sample --lowT=0.39 --highT=0.61 /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.train --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config
=========================================================
4. RUNNING GPRED-GC AND EVALUATION OF GENE PREDICTION
=========================================================
GPRED-GC has 2 mandatory arguments: the query file and the species.
The query file contains the DNA input sequence and must be in uncompressed (multiple) fasta format.
For example, the file may look like this
>name_of_sequence_1
agtgctgcatgctagctagct
>name_of_sequence_2
gtgctngcatgctagctagctggtgtnntgaaaaatt
Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted.
Usage:
gpredgc [parameters] --species=SPECIES queryfilename
or if we want to output to be redirected to a file:
gpredgc [parameters] --species=SPECIES queryfilename > output.gff
For example,
gpredgc --species=rice_megan_sample /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.test.fa > /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/output.gff --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/
The output format is gtf which is similar to General Feature Format (gff). For instance,
at003 GPREDGC gene 1 4395 0.97 + . g1
at003 GPREDGC transcript 1 4395 0.97 + . g1.t1
at003 GPREDGC intron 1 1079 0.98 + . transcript_id "g1.t1"; gene_id "g1";
at003 GPREDGC intron 1380 1502 1 + . transcript_id "g1.t1"; gene_id "g1";
The columns of output file consists of
Column 1: sequence name
Column 2: source of this annotation
Column 3: feature name
Column 4: beginning position of the feature
Column 5: end position of the feature
Column 6: score
Column 7: strand +/-
Column 8: transcript and gene name
Moreover, GPRED-GC also accepts genbank file as input for prediction. If we input the genbank file, GPRED-GC will compare its predicted genes with the annotated genes and print out the evaluations on nucleotide level, exon level, and gene level.
For example,
gpredgc --species=rice_megan_sample /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.test > /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.test.evaluation1 --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config