Home
Name Modified Size InfoDownloads / Week
README 2015-05-07 3.4 kB
config.cfg 2015-05-07 169 Bytes
zgst.sh 2015-05-07 30.4 kB
Totals: 3 Items   34.0 kB 0
PROGRAM: zgst.sh (Zed's Gene Set Test)
VERSION: 0.1
CONTACT: Mark Ziemann <mark.ziemann@gmail.com>

    USAGE
	./zgst.sh <-c configfile.config>
	./zgst.sh <-d DGE.xls,GeneIDcol,FCcol,Pvalcol> <-g gmtfile.gmt> [-w] [-b] [-t N]
	./zgst.sh -h

    OPTIONS
	-w	 weighted mode
	-b	 bootstrap mode (robust but slow)
	-t [int] number of threads (default: nproc autodetect)

    DESCRIPTION
	Zed's Gene Set Test (zgst) is a program for identification of differential expression of gene
	sets from a table of differential gene expression data (DGE table) and a gene set matrix in
	GMT format. The output is a table of gene sets, their aggregate differential expression score,
	their p-value and FDR adjusted p-value. Detailed reports are generated for a small number of
	gene sets. zgst is designed to be fast and easy to use (hopefully).

    INPUTS
	zgst requires either a config file or GMT file and DGE table. A config file can be provided
	alongside GMT file and DGE table but command line arguments override config file arguments.

    ALGORITHM
        Differential expression scores of each gene based on the significance. It uses the raw p-value
        not the adjusted p-value because the raw p-value has a smoother distribution. zgst uses R to
	scale the DGE scores prior to gene set analysis. The sign of the fold change is used to
	indicate the direction of the differential expression. Duplicate gene IDs are a problem for
	many analyses (like in R), and thus only the member with the most significant differential
	expression is retained. zgst also expects a header to be present on the DGE file. In order to
	estimate a p-value, zgst randomly selects the same number of genes as the gene set of interest
	from the DGE table and determines how often the random set shows greater or lower aggregate
	expression score. Finally false discovery rate correction of p-values is conducted using the
	Benjamini & Hochberg procedure. By default, zgst uses the classic mode, that is, only the
	position of the gene in the rank is considered rather than the differential score. The weighted
	mode (-w option) uses the scaled DGE score. Small gene sets can sometimes be considered
	significant if there is/are a one/few very significant individual genes, thus if using the
	weighted method, bootstrapping (-b option) is recommended. Bootstrapping is performed by
	discarding 10% of the genes in the set prior to each permutation. This method takes 10x longer
	to complete compared to the classic unbootstrapped procedure.

    FORMATTING
        zgst expects the DGE table to be a tab delimited table of values. Quotation marks are OK.
	zgst also expects a header to be present on the DGE table. Naturally, the GMT file gene
	identifiers need to match the DGE table.


    OUTPUTS
	A table of average ranks and p-values for each gene set. A detailed report for a select number of
	extreme gene sets. Enrichment and volcanoplots for each gene set with detailed reports.

    DEPENDANCIES
        zgst requires R-base, GNU parallel, shuf and gnuplot.

   EXAMPLE
	./zgst -c config.txt
	./zgst -d DESeq.xls,1,2,4 -g KEGG.gmt -t 4
        ./zgst -d edgeR.xls,1,3,5 -g mSigDB.gmt -c config.txt

   EXAMPLE CONFIG FILE
	The following text shows the formatting of the config file.

XLS=edgeR.xls
IDCOL=1
FCCOL=2
PCOL=4
GMT=Reactome.v4.0.symbols.gmt
NPERMUT=1000
MINGSSIZE=10
NUMDETAILEDREPORTS=50
WEIGHTED=TRUE
BOOTSTRAP=TRUE
MAXCPU=4

Source: README, updated 2015-05-07