| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| README | 2015-05-07 | 3.4 kB | |
| config.cfg | 2015-05-07 | 169 Bytes | |
| zgst.sh | 2015-05-07 | 30.4 kB | |
| Totals: 3 Items | 34.0 kB | 0 |
PROGRAM: zgst.sh (Zed's Gene Set Test)
VERSION: 0.1
CONTACT: Mark Ziemann <mark.ziemann@gmail.com>
USAGE
./zgst.sh <-c configfile.config>
./zgst.sh <-d DGE.xls,GeneIDcol,FCcol,Pvalcol> <-g gmtfile.gmt> [-w] [-b] [-t N]
./zgst.sh -h
OPTIONS
-w weighted mode
-b bootstrap mode (robust but slow)
-t [int] number of threads (default: nproc autodetect)
DESCRIPTION
Zed's Gene Set Test (zgst) is a program for identification of differential expression of gene
sets from a table of differential gene expression data (DGE table) and a gene set matrix in
GMT format. The output is a table of gene sets, their aggregate differential expression score,
their p-value and FDR adjusted p-value. Detailed reports are generated for a small number of
gene sets. zgst is designed to be fast and easy to use (hopefully).
INPUTS
zgst requires either a config file or GMT file and DGE table. A config file can be provided
alongside GMT file and DGE table but command line arguments override config file arguments.
ALGORITHM
Differential expression scores of each gene based on the significance. It uses the raw p-value
not the adjusted p-value because the raw p-value has a smoother distribution. zgst uses R to
scale the DGE scores prior to gene set analysis. The sign of the fold change is used to
indicate the direction of the differential expression. Duplicate gene IDs are a problem for
many analyses (like in R), and thus only the member with the most significant differential
expression is retained. zgst also expects a header to be present on the DGE file. In order to
estimate a p-value, zgst randomly selects the same number of genes as the gene set of interest
from the DGE table and determines how often the random set shows greater or lower aggregate
expression score. Finally false discovery rate correction of p-values is conducted using the
Benjamini & Hochberg procedure. By default, zgst uses the classic mode, that is, only the
position of the gene in the rank is considered rather than the differential score. The weighted
mode (-w option) uses the scaled DGE score. Small gene sets can sometimes be considered
significant if there is/are a one/few very significant individual genes, thus if using the
weighted method, bootstrapping (-b option) is recommended. Bootstrapping is performed by
discarding 10% of the genes in the set prior to each permutation. This method takes 10x longer
to complete compared to the classic unbootstrapped procedure.
FORMATTING
zgst expects the DGE table to be a tab delimited table of values. Quotation marks are OK.
zgst also expects a header to be present on the DGE table. Naturally, the GMT file gene
identifiers need to match the DGE table.
OUTPUTS
A table of average ranks and p-values for each gene set. A detailed report for a select number of
extreme gene sets. Enrichment and volcanoplots for each gene set with detailed reports.
DEPENDANCIES
zgst requires R-base, GNU parallel, shuf and gnuplot.
EXAMPLE
./zgst -c config.txt
./zgst -d DESeq.xls,1,2,4 -g KEGG.gmt -t 4
./zgst -d edgeR.xls,1,3,5 -g mSigDB.gmt -c config.txt
EXAMPLE CONFIG FILE
The following text shows the formatting of the config file.
XLS=edgeR.xls
IDCOL=1
FCCOL=2
PCOL=4
GMT=Reactome.v4.0.symbols.gmt
NPERMUT=1000
MINGSSIZE=10
NUMDETAILEDREPORTS=50
WEIGHTED=TRUE
BOOTSTRAP=TRUE
MAXCPU=4