##############################################
# Change log #
#############################################
2013-02-13 (v0.5):
- First published version. Implements the "flow method" as described in our paper (link in Traph's webpage).
2013-03-14 (v0.6):
- Added approach to search for bounded number of paths using dynamic programming. Activated with parameter -k.
- Added --outlier parameter to activate the outlier mode (cover version (default) is most of the time better tho).
- Removed --fitness parameter, as tests didn't show improvement with different fitness functions.
2013-05-10 (v0.6.1):
- Fixed a bug in graph creation that at times caused program to terminate
2013-06-13 (v0.7)
- Addition of lots of optional parameters
- Addition of mode using annotation (only used to guide the creation of the splicing graph at this point)
- Force "powonpaths 1" for flow if coverage is high enough, as LEMON library cannot cope with the number of
edges created for simulating squared error when coverage is very high
- Changed most internal errors within pipeline to notify and skip instead of terminating program
###############################################
# Running Traph-pipeline #
###############################################
Command to run Traph-pipeline is "python runtraph.py".
Usage: runtraph.py [options]
Options:
-h, --help show this help message and exit
Required options:
-i INPUT, --input=INPUT
Input file
-o OUTPUT, --output=OUTPUT
Output directory
-l READ_LENGTH, --read-length=READ_LENGTH
Read length
Flow options:
--flowrounding Activate flowrounding mode
--flowratio=FLOWRATIO
Flowratio value
--expressionratio=EXPRESSIONRATIO
Expressionratio value. Default mode with value 0.05.
Dynamic programming options:
-k K Activates dynamic programming to search for k paths.
Give 'min' as argument to search for number of paths
with minimum cost.
--maxk=MAXK Maximum number of paths being searched with dynamic
programming when looking for lowest value of objective
function. Default is 15. Lower values make Traph run
faster.
--ga-iterations=GA_ITERATIONS
Number of iterations for genetic algorithm (only for
dynamic programming, default is 5)
--nosplit Nosplit command for dynamic programming (use if Traph
stalls completely, temporary fix till the cause is
found)
Annotation/no annotation options:
-a ANNOTATION, --annotation=ANNOTATION
Annotation GTF file (optional, for the time being only
used for exon recognition)
--alternative-threshold=ALTERNATIVE_THRESHOLD
Slope threshold for searching for alternative
transcripts starts/ends. The smaller, the more
sensitive, but too small value can cause false
positives if coverage is very uneven. Default 0.3
(version without annotation only)
Other options:
--outlier Run outlier version (only for dynamic programming)
--powonpaths=POWONPATHS
Objective function for error on paths (see README)
--powoutsidepaths=POWOUTSIDEPATHS
Objective function for error outside paths (only for
dynamic programming, see README)
--length-correction
Use exon length in the objective function
-d, --debug Debug mode
###########################################
# Input and output #
###########################################
Traph takes as input a BAM file (single-end or paired-end). It is not necessary to tell Traph whether the data is single-end or paired-end, it will be deduced from the alignment tags. However, as single-end and paired-end data use different methods for creating the splicing graph, Traph will not perform optimally if the data is paired-end and has been aligned with a program that does not assign paired-end tags.
The reads in the BAM file need to be of same length, and the length supplied as parameter to runtraph.py script.
Traph outputs the predicted transcripts in standard GTF format. In addition to FPKM values for each transcript, Traph supplies the raw weight, that is, the average coverage per base of transcript.
##############################
# Flow options #
##############################
Traph computes a min-cost flow on a splicing graph, and then iteratively splits
this flow into different paths, which constitute the transcripts it reports.
In the default mode, Traph does not report transcripts whose expression level is
less than 5% of the most expressed transcript. This can be changed with
parameter --expressionratio.
Another strategy for splitting the min-cost flow, activated with option
--flowratio=FLOWRATIO is to iteratively split the flow into different paths
(transcripts) until FLOWRATIO of the total coverage of the splicing graph has been
explained. In our simulated experiments, --flowratio=0.95 gave similar accuracy as
--expressionratio=0.05 (the default mode).
In the Flowrounding mode, all weights are rounded to the closest multiple of the minimum
weight of an edge, and then the optimal flow is computed. This should ensure that the heuristic
for splitting the optimal flow into few paths works better.
#########################################
# Dynamic programming options #
#########################################
Parameter -k allows employing dynamic programming algorithm to search for specific
number of paths (note: this is paths per one gene region, not paths for the whole sample).
The value can be an integer, then Traph will attempt to find that many paths. Also value
"min" can be given, then Traph will try various values for number of paths, and select
the one which has the least error.
Please note that values of k higher than 3 or 4 can make Traph very slow on large graphs.
With the "min" option for the high values of k, Traph will use a heuristic of searching for a small number of
paths and removing these from the graphs till the desired number of paths have been found.
Option "maxk" sets the highest value of k that search triggered with "-k min" will search for. Default is 15.
Setting this value higher will allow for finding more transcripts per gene, but will substantially slow Traph
down.
If using higher value of k for "-k <int>", at the very least the value of parameter --ga-iterations, which
controls the number of times Traph uses genetic algorithm to find the best solution,
should be set lower than default.
##################################################
# Annotation/no annotation options #
##################################################
Traph is genome-based de novo transcript prediction tool, therefore it does not use annotation in the prediction.
However, we have added annotation guide for the creation of the splicing graph and naming the transcripts based
on the annotation.
For non-guided creation of the splicing graph, parameter "alternative-treshold" controls how sensitive the graph
creation is to alternative sources or sinks within exon. If the predicted transcripts seem fragmented, increase
the value, as it is likely graph creation is then interpreting coverage variation as sources and/or sinks.
#################################################
# Other options #
#################################################
Other options are highly optional, they are included for completion's sake.
"--outlier" activates the outlier variant (see our paper for k-UTEO). In vast majority of the cases, cover (default)
is a better option.
"--powonpaths" sets the objective function (error function) for nodes and edges that lie on path of a transcript.
"--powoutsidepaths" sets the objective functions for nodes and edges that do not lie on path of a transcript.
Flow engine does not support setting these to different values, but dynamic programming does. Any positive integer i
is an acceptable value, resulting to objective function |observed-predicted|^i. In addition there are two special cases:
value of -1 sets the objective function as |observed-predicted|/observed, and value of -2 sets the objective function
as |observed-predicted|^2/observed.
"--length-correction" applies a length correction to the objective function based on the length of the exon (that is,
shorter exons get penalized less and longer more). In our tests this did not seem to make a difference.
When debug mode is activated, it prints various status reports that can help with finding problem cases.