Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
dgeee.tar.gz | 2018-07-21 | 18.8 kB | |
Totals: 1 Item | 18.8 kB | 0 |
Differential gene expression from exon expression
This is a python implementation of the model for the analysis of differential gene expression in correlated RNA-Seq data. It takes exon counts as an input. We do analysis one gene at a time (gene-by-gene).
Prerequisites
The software is written in python 3.6 with the usage of following packages: numpy, scipy, pandas, openpyxl, patsy, numdifftools.
The software was tested on Linux-based systems
Quick start
Should take about 10 minutes
First run
We will run an analysis of two genes from a matched-paired experiment.
After unpacking targz archive make sure that all the files in the folder named dgeee
are executable. Create an empty folder for the results of this analysis.
Open file name start.py
and enter the path to the toydata
folder for the variable named dataDIR
(line 20). Next, enter the path to the results folder which you've created for the variable ResDIR
.
In terminal enter dgeee
folder and execute start.py
script. For example, python start.py
.
First, let's make sure, that the software works correctly and then we will discuss data structure, input configuration and the output. Compare table named part_1.csv
which appeared in the results folder which you've created with the table under the same name with the folder toyresults
. The order of gene1 and gene2 lines may differ, but otherwise two tables should be nearly identical. If there are discrepancies, or you see error messages in the terminal which does not make sense to you then contact dzianis.kazakevich[at]uhasselt.be
Input data
The software takes as an input .csv
files with exon counts. It process all .csv files in the directory dataDIR
and ignores files of other types or subfolders.
Let's take a look at gene1 file. It is a simulated data set from matched-pair experiment. For the sake of concreteness, imagine we would like to know whether gene1 (which consists of 3 exons) is differentially expressed between patients with metastases and patients without metastases. According to some criteria patient are grouped into pairs. In each pair one patient has metastases and another one does not.
First, instead of trying to parse actual gene1 file data let's rewrite the data for pair 1 in wide format.
exon1 | exon2 | exon3 | exon1 | exon2 | exon3 | |
---|---|---|---|---|---|---|
26 | 61 | 165 | 76 | 205 | 525 |
Patient on the left does not have metastases, whereas patient on the right does; and we would like to know whether the difference between counts on the left and on the right is significant.
Three exon counts on the left (26, 61, 165) in the table above are correlated, because they come from the same gene. Three counts on the right (76, 205, 525) are of course correlated as well. All six counts in the table are correlated because they come from the same pair.
There are 12 pairs in this data set. In each pair one patient has metastases and another one is metastases-free.
Let's now reshape the data into software-friendly format.
pair | indiv | exon | count | meta |
---|---|---|---|---|
1 | 1 | exon1 | 26 | 0 |
1 | 1 | exon2 | 61 | 0 |
1 | 1 | exon3 | 165 | 0 |
1 | 2 | exon1 | 76 | 1 |
1 | 2 | exon2 | 205 | 1 |
1 | 2 | exon3 | 525 | 1 |
The table above is an actual excerpt from gene1.csv file for pair 1 with some columns omitted. The variable indiv indicates subject. The variable pair indicates cluster, to which individual subject is part of. The variables exon and count refer to exon name and exon count respectively. meta in this case is a single explanatory variable.
Putting aside for a moment correlations and offsets, the essence of the model is very simple. Just like in regression models the main research question is whether predictor (in this case meta) is statistically significant and if so, what is the magnitude and the direction of the relationship. Just like in regression models there can be multiple predictors and combinations of thereof.
Configuration file
Configuration parameters should be written directly into start.py
script. Most of the parameters are strings and, according to python syntax, should be written in quotes. Please, refer to Configuration section for more.
Output
We have applied the model to the two genes. The output is saved in the results folder (which should be created by user) in file part_1.csv
. It is a table with point estimates of the parameters and their standard errors for each gene which is in the input folder. The same estimates, held in a separate file each, are stored in the directory GBGresDIR
(stands for Gene By Gene results).
The main parameter of interest is in this case meta. In the table meta means point estimate for that parameter and SEmeta means standard error of this estimate. If confidence interval [meta +- 1.96*k*SEmeta] does not contain zero then this predictor is significant.
IMPORTANT: Our simulations revealed that standard errors for predictors are too small. Empirical inflation inflation factor k should be applied in calculating confidence intervals. we've chosen to use the following factor: (N/(N-p)) , where N is number of clusters and p is the number of fixed effects parameters. The following gross oversimplification should work in the most cases: p equals number of predictors plus 1.
In this case we have just one predictor meta, so p equals to 2; and inflation factor is equal to 12/(12-2)=12/10=1.2
In our case we see, that for gene1 meta is 1.03 and for gene 2 it is 0.93, whereas SEmeta is about 0.05 for both. So, in both case this predictor is significant.
Configuration
Please, contact dzianis.kazakevich[at]uhasselt.be for more information