Download Latest Version analysis_and_scripts_v0.2.zip (43.8 MB)
Email in envelope

Get an email when there's a new version of EncodeGeneSetHub

Home
Name Modified Size InfoDownloads / Week
EncodeGeneSets_PosterPresentation.pdf 2017-04-03 2.0 MB
analysis_and_scripts_v0.2.zip 2017-04-03 43.8 MB
README.txt 2017-04-03 2.7 kB
gmt_v0.2.zip 2017-04-03 36.7 MB
analysis_and_scripts_v0.1.zip 2017-02-10 126.5 MB
gmt_v0.1.zip 2017-02-10 36.0 MB
Totals: 6 Items   245.0 MB 0
The ENCODE Gene Set Hub	v0.2	2017-04-03	Mark.ziemann@gmail.com

Copyright 2017 Mark Ziemann.

This repository of data files and code is distributed under the terms 
of the GNU General Public License version 3 (GPLv3).

If you use this in your academic publications Please cite the 
following:
The ENCODE Gene Set Hub. Ziemann M, Kaspi A, Rafehi H and El-Osta A. 
2017 Lorne Genome Conference. DOI: 10.13140/RG.2.2.34302.59208

The project aims to collate and curate the vast amounts of ENCODE 
profiling data into gene sets. These gene sets can be useful in 
performing pathway level analyses such as GSEA, CAMERA and many 
others. 

This repo is broken down into two parts: gmt and analysis_and_scripts.

GMT
This is the folder with the current version of published gmt files. 
These are ready to use with GSEA or other pathway analysis tools like 
GSEA.

Analysis and scripts
This is the nitty gritty. TFBS, DHS and histone data for mouse and 
human each have separate directories with this structure:

humanTFBS/
├── humanTFBSgmt_v0.1.sh
└── subtract_gmts
    ├── gmt_subtract.sh
    ├── gsea
    └── gsea_enh

The humanTFBSgmt_v0.1.sh shell generates the gmt files according to 
the parameters written in the script SIZE_RANGE (max allowed distance 
between peak and regulatory landmark like TSS or enhancer/distal reg 
element) and NUMRANGE (the max number of genes allowed in the set). 
This script was written to work on ubuntu 16.04 LTS and requires 
bedtools, liftOver and GNU parallel. It will download necessary data 
from ENCODE, Ensembl and other places. Just run it like this to get 
it to work as described in the poster: ./humanTFBSgmt_v0.1.sh

The subtract_gmts folder contains a script called gmt_subtract which 
enables subtraction of gene sets. This is useful for benchmarking and 
optimising parameters for gene selection. In this case the peak 
distance to regulatory landmark and the number of genes in the set 
were optimised. To get it to work, copy or link gmt files into the 
subtract folder and run the script: ./gmt_subtract.sh

Within the subtract_gmts folder there are two additional folders: 
gsea and gsea_enh which contain the tss and enhancer centric GSEA 
results respectively. Reports were extracted from the bulky GSEA 
output folders and the other data was deleted. Each XLS file is a 
summary of a GSEA run. GSEA was run with 3 different preranked 
RNA-seq profiles (*.rnk) suffix. If you want to replicate my work, 
follow these steps:

ln *test2.gmt gsea
cd gsea
#strip hyphens from filenames
rename 's/\-/+/' *gmt
#run gsea. it will take a while
./run_gsea.sh
# generate report for each GSEA run
./gsea_parse.sh
# collect, aggregate and summarise GSEA data
./eval_gmt.sh
Source: README.txt, updated 2017-04-03