Home
Name Modified Size InfoDownloads / Week
DHC-MEGE.sh 2013-02-05 13.8 kB
README.txt 2013-01-31 5.2 kB
Totals: 2 Items   19.0 kB 0
README for DHC-MEGE 31-1-2013

DHC-MEGE is a program for the identification of enriched motifs in genes found to
be differentially expressed in a microarray or RNA-seq experiment. What sets this
program apart from others is that it utilises a DNase hypersensitivity connectivity
map which allows for the identification of cis elements which may be hundreds of
kilobases away, but could be acting as enhancers/silencers/etc.

From the gene lists, DHC-MEGE uses HOMER motif finding and then outputs the
occurance of that motif in other DH regions across the genome and outputs this as
a GMT file. The GMT file then can be used in GSEA analysis.

Program Inputs
You need a list of up genes, down genes, a DNase hypersensitivity connectivity map 
(BED8 format) and genome sequence (fasta format). These need to specified with a
config file. See the section on the config file below.

Running the program
Simply run the program as you would any Unix shell script, make the script executable with chmod then execute, specifying the config file as the first argument. Here is an example:
chmod +x DHC-MEGE.sh
./DHC-MEGE.sh /path/to/config.cfg

Program Outputs
The program will create a results directory with a unique timestamp and deposit all
results there. The main output is the GMT file, which is a list of gene lists which
details the occurance of sequence motifs across the hypersensitive regions of the
genome. Homer generates reports for each of the identified motifs and even makes 
sequence logos if the software is correctly configured.

How can I use the output file?
GMT files can be used in gene set enrichment analysis (GSEA) of the array/mRNA-seq
experiment. Check out the Broad Institute website for more info on GSEA.
http://www.broadinstitute.org/gsea/msigdb/index.jsp

Need more help/information?
We have our article accepted in Bioinformation: “Motif analysis in DNAse
hypersensitivity regions uncovers distal cis-elements associated with gene expression” where you can see an example of DHC-MEGE in action.

The configuration file
Here is what an example config file looks like:
#####################################################################################
##CONFIG FILE FOR DHC-MEGE PIPELINE

##ENTER THE CORRELATION THRESHOLD
CORRTHRESH=0.8

##ENTER THE DESIRED SIZE OF GENESETS
GENESETSIZE=1000

##ENTER SIMILARITY THRESHOLD
SIMTHRESHOLD=10

##NUMBER CPUS
NR_CPUS=4

#ENTER THE LIST OF UPREGULATED GENES
UPGENELIST=/path/to/UPgenes.txt

#ENTER THE LIST OF DOWNREGULATED GENES
DOWNGENELIST=/path/to/DOWNgenes.txt

#PROVIDE AN IDENTIFICATION FOR THE EXPERIMENT
RUNNAME=ExperimentName

#PROVIDE THE PATH TO THE Genome
GENOME=/path/to/genome/sequence.fa

#PROVIDE THE PATH TO THE DHC-MAP
#HUMAN DHC MAP CAN BE OBTAINED FROM ENCODE
DHCMAP=/path/to/DHC-MAP.bed
#####################################################################################

Notes on the parameters in the confguration file

CORRTHRESH is the correlation coefficient between the proximal and distal
hypersensitivity regions. The more correlated they are, the more influential the 
distal element is for gene expression. The minimum CORRTHRESH that ENCODE use as
significant is 0.7. In our paper, we use 0.8.

GENESETSIZE is the maximum size of gene sets which will be output. This is an
important parameter because is we don't limit the gene set size, then we may have
uniquitous sequence motifs gene sets containing nearly all genes which is not 
informative when it comes to GSEA. We recommend using a limit of 1000. This limit
ensures that only the top 1000 sequence matches

SIMTHRESHOLD is the similarity threshold for inclusion of genes into the GMT. It is
log odds value computed by Homer and is used to sort identified motif instances. We
have used a threshold of 10.

NR_CPUS is the number of parallel threads to run. The program may use a couple more or less at any point in time depending on the available resources. We will be working to refine this behaviour in future versions.

UPGENELIST/DOWNGENELIST these files need to contain a list of genes, on on each new
line. Please ensure that the gene names used EXACTLY match those in the DHC map 
otherwise you could be missing a large number of motifs!

RUNNAME if you like, you can give your experiment a name for identification of the 
results directory.

GENOME during the motif searching process, Homer will sample bed regions from the
genes selected and needs to extract the actual sequence from the fasta file.

DHCMAP the path to the DHC map. The ENCODE DHC map for human is available from the 
EBI website. It has an 8 field bed format. Please see the ENCODE DNase
hypersensitivity analysis paper supplementary data (PMID: 22955617) for the link.

#####################################################################################
# This software is free to distribute and use under the GPL license.
# If you use this software in your academic work please cite our article in
# bioinformation - Ziemann et al, 2012.
# Please report bugs and give suggestions for future improvement.
# mark.ziemann@gmail.com
#####################################################################################
Source: README.txt, updated 2013-01-31