Download Latest Version Traph-pipeline_v0.7.2.tar.gz (5.8 MB)
Email in envelope

Get an email when there's a new version of traph

Home
Name Modified Size InfoDownloads / Week
0.7.2 2014-09-02
0.7.1 2014-01-31
0.7 2013-06-14
0.6.1 2013-05-13
0.6 2013-03-20
0.5 2013-02-13
README 2013-06-14 9.1 kB
validation-scripts.zip 2013-02-13 24.6 kB
Totals: 8 Items   33.7 kB 0
##############################################
# Change log                                 #
#############################################

2013-02-13 (v0.5): 
- First published version. Implements the "flow method" as described in our paper (link in Traph's webpage).

2013-03-14 (v0.6): 
- Added approach to search for bounded number of paths using dynamic programming. Activated with parameter -k. 
- Added --outlier parameter to activate the outlier mode (cover version (default) is most of the time better tho).
- Removed --fitness parameter, as tests didn't show improvement with different fitness functions.

2013-05-10 (v0.6.1):
- Fixed a bug in graph creation that at times caused program to terminate

2013-06-13 (v0.7)
- Addition of lots of optional parameters
- Addition of mode using annotation (only used to guide the creation of the splicing graph at this point)
- Force "powonpaths 1" for flow if coverage is high enough, as LEMON library cannot cope with the number of 
  edges created for simulating squared error when coverage is very high
- Changed most internal errors within pipeline to notify and skip instead of terminating program

###############################################
# Running Traph-pipeline                      #
###############################################

Command to run Traph-pipeline is "python runtraph.py".

Usage: runtraph.py [options]

Options:
  -h, --help            show this help message and exit

  Required options:
    -i INPUT, --input=INPUT
                        Input file
    -o OUTPUT, --output=OUTPUT
                        Output directory
    -l READ_LENGTH, --read-length=READ_LENGTH
                        Read length

  Flow options:
    --flowrounding      Activate flowrounding mode
    --flowratio=FLOWRATIO
                        Flowratio value
    --expressionratio=EXPRESSIONRATIO
                        Expressionratio value. Default mode with value 0.05.

  Dynamic programming options:
    -k K                Activates dynamic programming to search for k paths.
                        Give 'min' as argument to search for number of paths
                        with minimum cost.
    --maxk=MAXK         Maximum number of paths being searched with dynamic
                        programming when looking for lowest value of objective
                        function. Default is 15. Lower values make Traph run
                        faster.
    --ga-iterations=GA_ITERATIONS
                        Number of iterations for genetic algorithm (only for
                        dynamic programming, default is 5)
    --nosplit           Nosplit command for dynamic programming (use if Traph
                        stalls completely, temporary fix till the cause is
                        found)

  Annotation/no annotation options:
    -a ANNOTATION, --annotation=ANNOTATION
                        Annotation GTF file (optional, for the time being only
                        used for exon recognition)
    --alternative-threshold=ALTERNATIVE_THRESHOLD
                        Slope threshold for searching for alternative
                        transcripts starts/ends. The smaller, the more
                        sensitive, but too small value can cause false
                        positives if coverage is very uneven. Default 0.3
                        (version without annotation only)

  Other options:
    --outlier           Run outlier version (only for dynamic programming)
    --powonpaths=POWONPATHS
                        Objective function for error on paths (see README)
    --powoutsidepaths=POWOUTSIDEPATHS
                        Objective function for error outside paths (only for
                        dynamic programming, see README)
    --length-correction
                        Use exon length in the objective function
    -d, --debug         Debug mode



###########################################
# Input and output                        #
###########################################

Traph takes as input a BAM file (single-end or paired-end). It is not necessary to tell Traph whether the data is single-end or paired-end, it will be deduced from the alignment tags. However, as single-end and paired-end data use different methods for creating the splicing graph, Traph will not perform optimally if the data is paired-end and has been aligned with a program that does not assign paired-end tags.

The reads in the BAM file need to be of same length, and the length supplied as parameter to runtraph.py script.

Traph outputs the predicted transcripts in standard GTF format. In addition to FPKM values for each transcript, Traph supplies the raw weight, that is, the average coverage per base of transcript.


##############################
# Flow options               #
##############################

Traph computes a min-cost flow on a splicing graph, and then iteratively splits 
this flow into different paths, which constitute the transcripts it reports. 

In the default mode, Traph does not report transcripts whose expression level is 
less than 5% of the most expressed transcript. This can be changed with 
parameter --expressionratio.

Another strategy for splitting the min-cost flow, activated with option 
--flowratio=FLOWRATIO is to iteratively split the flow into different paths 
(transcripts) until FLOWRATIO of the total coverage of the splicing graph has been 
explained. In our simulated experiments, --flowratio=0.95 gave similar accuracy as 
--expressionratio=0.05 (the default mode).

In the Flowrounding mode, all weights are rounded to the closest multiple of the minimum 
weight of an edge, and then the optimal flow is computed. This should ensure that the heuristic 
for splitting the optimal flow into few paths works better.


#########################################
# Dynamic programming options           #
#########################################

Parameter -k allows employing dynamic programming algorithm to search for specific
number of paths (note: this is paths per one gene region, not paths for the whole sample). 
The value can be an integer, then Traph will attempt to find that many paths. Also value
"min" can be given, then Traph will try various values for number of paths, and select
the one which has the least error.

Please note that values of k higher than 3 or 4 can make Traph very slow on large graphs.
With the "min" option for the high values of k, Traph will use a heuristic of searching for a small number of
paths and removing these from the graphs till the desired number of paths have been found.

Option "maxk" sets the highest value of k that search triggered with "-k min" will search for. Default is 15.
Setting this value higher will allow for finding more transcripts per gene, but will substantially slow Traph
down.

If using higher value of k for "-k <int>", at the very least the value of parameter --ga-iterations, which 
controls the number of times Traph uses genetic algorithm to find the best solution, 
should be set lower than default.

##################################################
# Annotation/no annotation options               #
##################################################

Traph is genome-based de novo transcript prediction tool, therefore it does not use annotation in the prediction.
However, we have added annotation guide for the creation of the splicing graph and naming the transcripts based
on the annotation.

For non-guided creation of the splicing graph, parameter "alternative-treshold" controls how sensitive the graph
creation is to alternative sources or sinks within exon. If the predicted transcripts seem fragmented, increase 
the value, as it is likely graph creation is then interpreting coverage variation as sources and/or sinks.


#################################################
# Other options                                 #
#################################################

Other options are highly optional, they are included for completion's sake.

"--outlier" activates the outlier variant (see our paper for k-UTEO). In vast majority of the cases, cover (default)
is a better option.

"--powonpaths" sets the objective function (error function) for nodes and edges that lie on path of a transcript. 
"--powoutsidepaths" sets the objective functions for nodes and edges that do not lie on path of a transcript.

Flow engine does not support setting these to different values, but dynamic programming does. Any positive integer i
is an acceptable value, resulting to objective function |observed-predicted|^i. In addition there are two special cases: 
value of -1 sets the objective function as |observed-predicted|/observed, and value of -2 sets the objective function 
as |observed-predicted|^2/observed.

"--length-correction" applies a length correction to the objective function based on the length of the exon (that is,
shorter exons get penalized less and longer more). In our tests this did not seem to make a difference.

When debug mode is activated, it prints various status reports that can help with finding problem cases.


Source: README, updated 2013-06-14