Architecture

Authors:

ARCHITECTURE

[GEPETTO] uses the jBPM (JBoss Business Process Managment) workflow engine to define and execute the prioritization process. We focused on the development around this workflow engine because it is developed by Red Hat and has an active and well developed community.

jBPM is a flexible Business Process Management Suite. It's light-weight, fully open-source (distributed under Apache license) and written in Java. It allows to model, execute and monitor business processes, throughout their life cycle.

alternate "GEPETTO workflow for jBPM"

GEPETTO workflow

Dataset Loader

The aim of the first step is to load the data for identifying the different genes from training and test sets and their associated proteins (or vice versa). Then it is used to build the relationships between the those features.

This step uses data (genes/proteins cards) provided by SM2PH Central.

Modules - Local prioritization

The [GEPETTO] [Modules] are based on different parameters and does not interract each other. The parallel gateway is perfectly adapted to this model. In this gateway, each local prioritization module (subbranch) will correspond to a new service task.

A parallel gateway is used to split or synchronize the respectively incoming or outgoing sequence flow.

A parallel gateway with one incoming sequence flow and more than one outgoing sequence flow is called a 'parallel split or an 'AND-split'. All outgoing sequence flow are going to be taken in parallel. It is used in [GEPETTO] to go from genes / proteins loader to local prioritization modules.
A parallel gateway with multiple incoming sequence flow and one outgoing sequence flow is called a 'parallel join' or an AND-join. All incoming sequence flow need to arrive in this parallel joing before the outgoing sequence flow is taken. It is used in [GEPETTO] to go from local prioritization modules to genes / proteins loader.

The following diagram shows how a parallel gateway can be used.

Data Fusion - Global prioritization

The global prioritization method is based on the data fusion presented by Aerts et al., 2006 : Order statistics.

The rankings from the separate data sources are combined using order statistics. A Q statistic is calculated from all rank ratios using the joint cumulative distribution of an N-dimensional order statistic as previously done by Stuart et al., 2003

They propose the following recursive formula to compute the above integral:

where ri is the rank ratio for data source i, N is the number of data sources used, and r0 = 0. However, two problems arose when we tried to use this formula. First, we noticed that this formula is highly inefficient for moderate values of N, and even intractable for N > 12 because its complexity is O(N!). We therefore implemented a much faster alternative formula with complexity O(N2):

with Q(r1,r2,...,rN) = N!VN, V0 = 1, and ri is the rank ratio for data source i.

As they said, this is a solution to minimize and to increase the performance of ranking.

We integrated all individual prioritizations into a single overall rank by implementing an algorithm based on order statistics. With this algorithm, the probability of finding a gene at all the observed positions is calculated and a single overall rank is obtained by ranking genes according to these probabilities.

Wiki: GEPETTO
Wiki: Home
Wiki: Modules

GEPETTO - Gene Prioritization in Java Wiki

GEPETTO (GEne Prioritization ExTended TOol)