Download Latest Version gpredgc.2.4.1.src.tar.gz (36.0 MB)
Email in envelope

Get an email when there's a new version of GPRED-GC

Home
Name Modified Size InfoDownloads / Week
datasets.tar.gz 2018-11-30 326.1 kB
README.txt 2018-11-30 6.9 kB
gpredgc.2.4.1.src.tar.gz 2018-11-30 36.0 MB
gpredgc.2.4.src.tar.gz 2018-03-05 35.0 MB
Totals: 4 Items   71.3 MB 0
=========================================================

1. VERSION: gpredgc.2.4.1.src

=========================================================

GPRED-GC is a Gene PREDiction program accounting for 5'-3' GC gradient.


For any questions, comments, suggestions, etc, please send emails to yannisun@msu.edu.




=========================================================

2. INSTALLATION


=========================================================


1) unpack



>tar -xzf gpredgc.2.4.1.src.tar.gz



The tar-archive contains one directory 'gpredgc.2.4.1.src' with the following 

sub-directories:

bin

src

include

config

examples

scripts

docs

datasets



2) compile (if not already compiled)


> cd src

> make



3) set environment variable GPREDGC_CONFIG_PATH



> export GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/



The program requires that the environment variable GPREDGC_CONFIG_PATH is set to the config directory that contains the

configuration and parameter files. This is the directory 'gpredgc.2.4.1.src/config'. You probably want to add this line to a startup script (like ~/.bashrc).

Alternatively, you can specify this directory on the command line when you run gpredgc:

--GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/

You may want to add the path of the executable to the PATH environment variable or copy gpredgc into a common directory (e.g. /usr/bin/).




=========================================================

3. TRAINING GPRED-GC

You can choose to run either etraining or pipeline.py
=========================================================

3.1 The program 'etraining' reads the meta parameters from the .cfg file and a genbank file with annotated genes and writes the other species specific parameters into the 3 .pbl files: species_exon_probs.pbl, species _intron_probs.pbl, and species_igenic_probs.pbl. Meta parameter file contains parameters for each species such as the order of the Markov model, the size of the window used for the splice site models, and so on.
***The transition probabilities were divided into three equal parts.*** 


Usage:

etraining --species=SPECIES --lowT=0.39 --highT=0.61 trainfilename --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config

'--lowT' is the cutoff of low GC content.
'--highT' is the cutoff of high GC content.
'trainfilename' is the filename (including relative path) to the file in genbank format containing the training sequences. 

These can be multi-gene sequences and genes on the reverse strand. However, the genes must not overlap and only one transcript is allowed.



For example,

etraining --species=rice_megan_sample --lowT=0.39 --highT=0.61 /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.train --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config


GPRED-GC trains all values of probabilities from rice_megan_sample and rice_megan_sample.gb.train and outputs three .pbl files: rice_megan_sample_exon_probs.pbl, rice_megan_sample_intron_probs.pbl, and rice_megan_sample_igenic_probs.pbl.

=========================================================
3.2 The program 'pipeline.py' reads the meta parameters from the .cfg file and a genbank file with annotated genes and writes the other species specific parameters into the 3 .pbl files: species_exon_probs.pbl, species _intron_probs.pbl, and species_igenic_probs.pbl. Meta parameter file contains parameters for each species such as the order of the Markov model, the size of the window used for the splice site models, and so on.
***The transition probabilities were trained using maximum likelihood estimation.*** 


Usage:
python pipeline.py --species=SPECIES --lowT=0.39 --highT=0.61 trainfilename --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/

'--lowT' is the cutoff of low GC content.
'--highT' is the cutoff of high GC content.
'trainfilename' is the filename (including relative path) to the file in genbank format containing the training sequences. 

These can be multi-gene sequences and genes on the reverse strand. However, the genes must not overlap and only one transcript is allowed.


For example,

python pipeline.py --species=rice_megan_sample --lowT=0.39 --highT=0.61 /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.train --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config


=========================================================

4. RUNNING GPRED-GC AND EVALUATION OF GENE PREDICTION

=========================================================


GPRED-GC has 2 mandatory arguments: the query file and the species.

The query file contains the DNA input sequence and must be in uncompressed (multiple) fasta format.

For example, the file may look like this

>name_of_sequence_1

agtgctgcatgctagctagct

>name_of_sequence_2

gtgctngcatgctagctagctggtgtnntgaaaaatt



Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted.



Usage:
gpredgc [parameters] --species=SPECIES queryfilename

or if we want to output to be redirected to a file:

gpredgc [parameters] --species=SPECIES queryfilename > output.gff

For example,

gpredgc --species=rice_megan_sample /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.test.fa > /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/output.gff --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config/



The output format is gtf which is similar to General Feature Format (gff). For instance,

at003	GPREDGC	gene	1	4395	0.97	+	.	g1

at003	GPREDGC	transcript	1	4395	0.97	+	.	g1.t1

at003	GPREDGC	intron	1	1079	0.98	+	.	transcript_id "g1.t1"; gene_id "g1";

at003	GPREDGC	intron	1380	1502	1	+	.	transcript_id "g1.t1"; gene_id "g1";



The columns of output file consists of

Column 1: sequence name

Column 2: source of this annotation

Column 3: feature name

Column 4: beginning position of the feature

Column 5: end position of the feature

Column 6: score

Column 7: strand +/-

Column 8: transcript and gene name



Moreover, GPRED-GC also accepts genbank file as input for prediction. If we input the genbank file, GPRED-GC will compare its predicted genes with the annotated genes and print out the evaluations on nucleotide level, exon level, and gene level.



For example, 

gpredgc --species=rice_megan_sample /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.test > /my_path_to_GPREDGC/gpredgc.2.4.1.src/datasets/rice_megan_sample/sample.gb.test.evaluation1 --GPREDGC_CONFIG_PATH=/my_path_to_GPREDGC/gpredgc.2.4.1.src/config

Source: README.txt, updated 2018-11-30