goSTAG: Gene Ontology Subtrees to Tag and Annotate Genes within a set
Developed by Brian Bennett and Pierre Bushel
National Institsues of Health
National Institue of Environmental Health Sciences
RTP, NC
Report bugs, corrections and suggestions to:
brian.bennett@nih.gov
bushel@niehs.nih.gov
Public Domain Notice:
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.
goSTAG is an R script that can be executed on any machine (platform independent) that has Rscript installed and has been tested
in Linux and Windows OS environments with R version 3.0.0.
The software is available through SourceForge: http://gostag.sourceforge.net
Unzip the goSTAG folder to a directory of your choice.
The folder contains the following files and folders:
The bin folder:
goSTAG_vXXXX.R: The R source code for goSTAG
Sample_data folder:
myTopo_Oxali_DEGs.gmt: GMT file conatining RefSeq gene symbols of the DEGs from the Davis et al., 2015 publication.
GO_genes_rat.gmt: GMT file of rat genes associated with Gene Ontology (GO) terms
GO_ontology.gmt: GMT file of the GO terms and their relationships in the hierarchical structure
Sample_output:
Topo_Oxali_heatmap_min_5_pval_0.05_corr_0.9_subtree_30_GO_bps.png: png image file with heat map and labeling of clusters
Topo_Oxali_heatmap_min_5_pval_0.05_corr_0.9_subtree_30_GO_bps.gmt: output GMT file with the goSTAG clusters and their GO terms
README.txt: This readme file
Usage:
Rscript goSTAG.R [options] --gene_lists <GMT_file> --go_genes <GMT_file> --ontology <GMT_file> --out_heatmap <PNG_file>
goSTAG uses GMT files, which have the following format:
1. Each line in the GMT file corresponds to a list
2. Each line has tab-delimited entries and can be different sizes
3. The first entry is the list name
4. The second entry is the list description (usually ignored, but still must be present)
5. The subsequent entries are the items in the list
See here for specification of the GMT format: www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Required arguments:
--gene_lists <GMT_file> or --gene_lists_dir <directory>
It is required to have either a single GMT file or a directory containing files. If using a GMT file, each line is a
gene list. The first entry is the gene list name, the second entry is ignored by the software (but still must be
present), and the subsequent entries are the gene symbols of the genes in the gene list. If using a directory, each
file in the directory is a gene list. By default, these files contain no header and have the gene symbols of the
genes in the first column of the file. This default behavior can be changed by using the --files_have_header and/or
the --gene_symbol_column arguments.
--go_genes <GMT_file>
It is required to have a single GMT file with the GO terms that are to be analyzed, along with the genes associated
with those GO terms. Each line is a GO term. The first entry is the GO ID of the GO term, the second entry is the
name of the GO term, and the subsequent entries are the gene symbols of the genes associated with the GO term.
--ontology <GMT_file>
It is required to have a single GMT with the GO ontology. Each line is a GO term in the ontology. The first entry is
the GO ID of the GO term, the second entry is ignored (but still must be present), and the subsequent entries are the
GO IDs of the parents of the GO term.
--out_heatmap <PNG_file>
The required filename of the output PNG image.
GO terms and their relationships in the hierarchical structure are obtained from the Gene Ontology website: geneontology.org/page/download-ontology
The annotation of genes to GO terms and the gene symbols according to the RefSeq gene model are obtained from gene2go.txt
downloaded from ftp.ncbi.nlm.nih.gov/gene/DATA and gene2refesq.txt downloaded from ftp.ncbi.nlm.nih.gov/gene/DATA/
Optional arguments:
--out_clusters <GMT_file>
An output GMT file with the goSTAG clusters and their go terms. Each line is a cluster. The first entry is the name
of the cluster, the second entry is the name of the representative GO term for that cluster, and the subsequent
entries are the GO IDs of the GO terms in the cluster, sorted by number of paths to the root GO term.
--min_num_genes <number> (default: 5)
The minimum number of genes required to be associated with a GO term for it to be included in the analysis. Any GO
term with less than this number of genes associated with it is removed.
--go_domain <BP, MF, CC, or all> (default: BP)
This will filter the GO terms to only include those with the selected domain (biological process, molecular function,
cellular component, or all domains). Any GO term that doesn't belong to the selected domain is removed.
--filter_method <pval or FDR> (default: pval)
Only significant GO terms will be included in the heatmap, hierarchical clustering, and GO term clusters. This will
specify whether to use p-value or FDR value to determine which GO terms are significant.
--significance_threshold <number> (default: 0.05)
The p-value or FDR threshold used to determine which GO terms are significant.
--distance_metric <euclidean or correlation> (default: correlation)
This will specify whether to use Euclidean distance or 1 - abs( Pearson correlation ) as the distance metric used in
the hierarchical clustering.
--distance_threshold <number> (default: 0.2)
GO terms in the hierarchical clustering dendrogram with a distance of less than this value will be merged into
clusters.
--min_num_terms <number> (default: 10)
Clusters with this many or more GO terms will be labeled in the output PNG image. Clusters with fewer than this many
GO terms will not be labeled, but will still be present in the clusters GMT file.
--maximum_p_value <-log10_number> (default: 10)
When setting the colors in the heatmap, all values with a -log10 p-value that is greater than this number will be
floored to this value. Decreasing this number may improve contrast in the heatmap at the cost of reduced dynamic
range.
--extra_color <TRUE or FALSE> (default: FALSE)
By default, the color scale will range from grey to red. Setting this to true will cause the color scale to range
from grey, to yellow, to red.
--heatmap_width <pixels> (default: 1600)
The width (in pixels) of the output PNG image.
--heatmap_height <pixels> (default: 1200)
The height (in pixels) of the output PNG image.
--files_have_header <TRUE or FALSE> (default: FALSE)
If set to true, the first line in the gene list files will be ignored.
--gene_symbol_column <number> (default: 1)
The column in the gene list files with the gene symbols of the genes in the gene list.
--heatmap_margin <number> (default: 0.01)
The percentage of the image to devote to the margin.
--heatmap_dendrogram_width <number> (default: 0.4)
The percentage of the image to devote to the dendrogram. Increase this value if you want a larger dendrogram.
--heatmap_cluster_width <number> (default: 0.5)
The percentage of the dendrogram to devote to the cluster labels. Increase this value if you want the cluster labels
to be larger.
--heatmap_header_height <number> (default: 0.2)
The percentage of the image to devote to the heatmap header labels. Increase this value if you want the header labels
to be larger.
--heatmap_dendrogram_lwd <number> (default: 2)
The dendrogram line thickness (in R lwd scale). Increase this value if you want the dendrogram lines to be thicker.
--heatmap_cluster_label_cex <number> (default: 2)
The cluster label text size (in R cex scale). Increase this value if you want the cluster label text to be larger.
--heatmap_header_label_cex <number> (default: 2)
The heatmap header label text size (in R cex scale). Increase this value if you want the header label text to be
larger.
References:
Bennett BD and Bushel PR. goSTAG: Gene Ontology Subtrees to Tag and Annotate Genes within a set
Davis M, Li J, Knight E, Eldridge SR, Daniels KK, Bushel PR. Toxicogenomics
profiling of bone marrow from rats treated with topotecan in combination with
oxaliplatin: a mechanistic strategy to inform combination toxicity. Front Genet.
2015 Feb 12;6:14