TransGeneScan Code

TransGeneScan is a gene finding tool for metatranscriptomic sequences

Brought to you by: wazimismail, yuzhenye

Tree [0edb8d] master / History

HTTPS access

File	Date	Author	Commit
example	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
scripts	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
train	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
FGS_gff.py	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
FragGeneScan	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
Makefile	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
README	2014-02-25	Wazim MohammedIsmail	[0edb8d] new changes
hmm.h	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
hmm_lib.c	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
hmm_lib.o	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
post_process.pl	2014-02-25	Wazim MohammedIsmail	[0edb8d] new changes
processFragOut.py	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
run_TransGeneScan.pl	2014-02-25	Wazim MohammedIsmail	[0edb8d] new changes
run_hmm.c	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
run_hmm.o	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
util_lib.c	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review
util_lib.h	2014-01-28	Wazim MohammedIsmail	[2c14bd] Initial commit
util_lib.o	2014-02-08	Wazim MohammedIsmail	[b3c3b8] Address code review

Read Me

Installation
=============
To install TransGeneScan, please follow the steps below:

1. Untar the downloaded file "TransGeneScan.tar.gz". This will automatically generate the directory "TransGeneScan".

2. Make sure that you also have a C compiler such as "gcc" and perl interpreter.

3. Run "makefile" to compile and build excutable
	make clean
	make fgs


Running the program
====================
1.  To run TransGeneScan, 

./run_TransGeneScan.pl -in=[seq_file_name] -out=[output_file_name]

[seq_file_name]: sequence file name including the full path
[output_file_name]: output file name including the full path


Assembly of Transcripts
=======================
1. To assemble transcripts based on read mappings onto a single reference genome,

./scripts/pipeline.sh [reference_file] [reads_prefix] [TGSHome] [n] [k] [t]

[reference_file]: reference sequence file including full path
[reads_prefix]: paired-end reads prefix (not including _1.fastq, _2.fastq) including full path. The suffixes, _1.fastq and _2.fastq, are added within the script. Please make sure the files are named appropriately. 
[TGSHome]: Full path of TransGeneScan home directory
[n],[k],[t]: These are bwa parameters (please see bwa documentation for more information). The values used for testing were 4,4,4

Source files included
=====================
1. run_hmm.c, util_lib.c, util_lib.h, hmm.h, hmm_lib.c
These files contain the main Hidden Markov Model (HMM) framework of the prediction system. Most of the code is re-used from FragGeneScan as is. 

2. run_TransGeneScan.pl
This script is the main front end for the user to call the program for prediction. (See "Running the program" above)

3. post_process.pl
This script is part of the original FragGeneScan which makes corrections in the position of start codon based on a prediction model (See reference for more details). This code is re-used as is, in TransGeneScan.

4. FGS_gff.py
This script converts the TransGeneScan output format (which is the same as FragGeneScan output format) into gff format. 

5. processFragOut.py
This script is used to output predictions on sense transcripts and antisense transcripts as separate files.

6. train/*
These files include the training parameters used by the HMM. 

7. scripts/*
These scripts are used to do the assembly of transcripts based on read mappings (See "Assembly of Transcripts" above). 

Sample files included
=====================
1. example/transcripts.fasta
This file is the transcript assembly output produced by running scripts/pipeline.sh using paired-end reads downloaded from Short Reads Archive (SRR442380) mapped on to E.coli (NC_000913) as reference.

2. example/TGSout.out, TGSout.ffn, TGSout.faa, TGSout.gff
Prediction output from TransGeneScan in FGS output format (see below), nucleic acid fasta format, amino acid fasta format and gff format.

3. example/TGSout.sn
Prediction output from TransGeneScan containing only sense transcripts in FGS output format. 

4. example/TGSout.as
Prediction output from TransGeneScan containing only antisense transcripts. Since each entire transcript is an antisense transcript, no start/stop ranges are specified.


FGS output format
=================
This format lists the coordinates of putative genes. This file consists of five columns (start position, end position, strand, frame, score).  For example,

>ftranscript:1741:5049
217     1059    +       1       1.297925        I:      D:
1061    1993    +       2       1.310458        I:      D:
1994    3280    +       2       1.289984        I:      D:
>ftranscript:6551:6792
1       242     -       3       1.304437        I:      D:


Reference
=========
Rho, M., Tang, H., Ye, Y.: Fraggenescan: predicting genes in short and error-prone reads. Nucleic acids research 38(20), 191-191 (2010)

License
============
Copyright (C) 2013 Wazim Mohammed Ismail, Yuzhen Ye and Haixu Tang.
You may redistribute this software under the terms of the GNU General Public License.

TransGeneScan Code

TransGeneScan is a gene finding tool for metatranscriptomic sequences

Branches

Tree [0edb8d] master / Download Snapshot History

Read Me

Tree [0edb8d] master /

History