Name | Modified | Size | Downloads / Week |
---|---|---|---|
EncodeGeneSets_PosterPresentation.pdf | 2017-04-03 | 2.0 MB | |
analysis_and_scripts_v0.2.zip | 2017-04-03 | 43.8 MB | |
README.txt | 2017-04-03 | 2.7 kB | |
gmt_v0.2.zip | 2017-04-03 | 36.7 MB | |
analysis_and_scripts_v0.1.zip | 2017-02-10 | 126.5 MB | |
gmt_v0.1.zip | 2017-02-10 | 36.0 MB | |
Totals: 6 Items | 245.0 MB | 0 |
The ENCODE Gene Set Hub v0.2 2017-04-03 Mark.ziemann@gmail.com Copyright 2017 Mark Ziemann. This repository of data files and code is distributed under the terms of the GNU General Public License version 3 (GPLv3). If you use this in your academic publications Please cite the following: The ENCODE Gene Set Hub. Ziemann M, Kaspi A, Rafehi H and El-Osta A. 2017 Lorne Genome Conference. DOI: 10.13140/RG.2.2.34302.59208 The project aims to collate and curate the vast amounts of ENCODE profiling data into gene sets. These gene sets can be useful in performing pathway level analyses such as GSEA, CAMERA and many others. This repo is broken down into two parts: gmt and analysis_and_scripts. GMT This is the folder with the current version of published gmt files. These are ready to use with GSEA or other pathway analysis tools like GSEA. Analysis and scripts This is the nitty gritty. TFBS, DHS and histone data for mouse and human each have separate directories with this structure: humanTFBS/ ├── humanTFBSgmt_v0.1.sh └── subtract_gmts ├── gmt_subtract.sh ├── gsea └── gsea_enh The humanTFBSgmt_v0.1.sh shell generates the gmt files according to the parameters written in the script SIZE_RANGE (max allowed distance between peak and regulatory landmark like TSS or enhancer/distal reg element) and NUMRANGE (the max number of genes allowed in the set). This script was written to work on ubuntu 16.04 LTS and requires bedtools, liftOver and GNU parallel. It will download necessary data from ENCODE, Ensembl and other places. Just run it like this to get it to work as described in the poster: ./humanTFBSgmt_v0.1.sh The subtract_gmts folder contains a script called gmt_subtract which enables subtraction of gene sets. This is useful for benchmarking and optimising parameters for gene selection. In this case the peak distance to regulatory landmark and the number of genes in the set were optimised. To get it to work, copy or link gmt files into the subtract folder and run the script: ./gmt_subtract.sh Within the subtract_gmts folder there are two additional folders: gsea and gsea_enh which contain the tss and enhancer centric GSEA results respectively. Reports were extracted from the bulky GSEA output folders and the other data was deleted. Each XLS file is a summary of a GSEA run. GSEA was run with 3 different preranked RNA-seq profiles (*.rnk) suffix. If you want to replicate my work, follow these steps: ln *test2.gmt gsea cd gsea #strip hyphens from filenames rename 's/\-/+/' *gmt #run gsea. it will take a while ./run_gsea.sh # generate report for each GSEA run ./gsea_parse.sh # collect, aggregate and summarise GSEA data ./eval_gmt.sh