Download Latest Version goSTAG_v0.160425.zip (1.5 MB)
Email in envelope

Get an email when there's a new version of goSTAG

Home
Name Modified Size InfoDownloads / Week
goSTAG 2016-05-03
README.txt 2016-05-03 8.6 kB
Totals: 2 Items   8.6 kB 0
goSTAG: Gene Ontology Subtrees to Tag and Annotate Genes within a set

Developed by Brian Bennett and Pierre Bushel
National Institsues of Health
National Institue of Environmental Health Sciences
RTP, NC

Report bugs, corrections and suggestions to:
brian.bennett@nih.gov
bushel@niehs.nih.gov

Public Domain Notice:
 This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.



goSTAG is an R script that can be executed on any machine (platform independent) that has Rscript installed and has been tested 
in Linux and Windows OS environments with R version 3.0.0.

The software is available through SourceForge: http://gostag.sourceforge.net 


Unzip the goSTAG folder to a directory of your choice.
The folder contains the following files and folders:
	The bin folder: 
		goSTAG_vXXXX.R: The R source code for goSTAG
	Sample_data folder: 
		myTopo_Oxali_DEGs.gmt: GMT file conatining RefSeq gene symbols of the DEGs from the Davis et al., 2015 publication.
		GO_genes_rat.gmt: GMT file of rat genes associated with Gene Ontology (GO) terms
		GO_ontology.gmt: GMT file of the GO terms and their relationships in the hierarchical structure
					
	Sample_output: 
		Topo_Oxali_heatmap_min_5_pval_0.05_corr_0.9_subtree_30_GO_bps.png: png image file with heat map and labeling of clusters
		Topo_Oxali_heatmap_min_5_pval_0.05_corr_0.9_subtree_30_GO_bps.gmt: output GMT file with the goSTAG clusters and their GO terms
		
	README.txt: This readme file 


Usage:
  Rscript goSTAG.R [options] --gene_lists <GMT_file> --go_genes <GMT_file> --ontology <GMT_file> --out_heatmap <PNG_file>

goSTAG uses GMT files, which have the following format:
  1. Each line in the GMT file corresponds to a list
  2. Each line has tab-delimited entries and can be different sizes
  3. The first entry is the list name
  4. The second entry is the list description (usually ignored, but still must be present)
  5. The subsequent entries are the items in the list

See here for specification of the GMT format:  www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

Required arguments:
  --gene_lists <GMT_file> or --gene_lists_dir <directory>
    It is required to have either a single GMT file or a directory containing files. If using a GMT file, each line is a
    gene list. The first entry is the gene list name, the second entry is ignored by the software (but still must be
    present), and the subsequent entries are the gene symbols of the genes in the gene list. If using a directory, each
    file in the directory is a gene list. By default, these files contain no header and have the gene symbols of the
    genes in the first column of the file. This default behavior can be changed by using the --files_have_header and/or
    the --gene_symbol_column arguments.

  --go_genes <GMT_file>
    It is required to have a single GMT file with the GO terms that are to be analyzed, along with the genes associated
    with those GO terms. Each line is a GO term. The first entry is the GO ID of the GO term, the second entry is the
    name of the GO term, and the subsequent entries are the gene symbols of the genes associated with the GO term.    

  --ontology <GMT_file>
    It is required to have a single GMT with the GO ontology. Each line is a GO term in the ontology. The first entry is
    the GO ID of the GO term, the second entry is ignored (but still must be present), and the subsequent entries are the
    GO IDs of the parents of the GO term.

  --out_heatmap <PNG_file>
    The required filename of the output PNG image.

GO terms and their relationships in the hierarchical structure are obtained from the Gene Ontology website: geneontology.org/page/download-ontology
The annotation of genes to GO terms and the gene symbols according to the RefSeq gene model are obtained from “gene2go.txt” 
downloaded from ftp.ncbi.nlm.nih.gov/gene/DATA and “gene2refesq.txt” downloaded from ftp.ncbi.nlm.nih.gov/gene/DATA/

Optional arguments:
  --out_clusters <GMT_file>
    An output GMT file with the goSTAG clusters and their go terms. Each line is a cluster. The first entry is the name
    of the cluster, the second entry is the name of the representative GO term for that cluster, and the subsequent
    entries are the GO IDs of the GO terms in the cluster, sorted by number of paths to the root GO term.

  --min_num_genes <number> (default: 5)
    The minimum number of genes required to be associated with a GO term for it to be included in the analysis. Any GO
    term with less than this number of genes associated with it is removed.

  --go_domain <BP, MF, CC, or all> (default: BP)
    This will filter the GO terms to only include those with the selected domain (biological process, molecular function,
    cellular component, or all domains). Any GO term that doesn't belong to the selected domain is removed.

  --filter_method <pval or FDR> (default: pval)
    Only significant GO terms will be included in the heatmap, hierarchical clustering, and GO term clusters. This will
    specify whether to use p-value or FDR value to determine which GO terms are significant.

  --significance_threshold <number> (default: 0.05)
    The p-value or FDR threshold used to determine which GO terms are significant.

  --distance_metric <euclidean or correlation> (default: correlation)
    This will specify whether to use Euclidean distance or 1 - abs( Pearson correlation ) as the distance metric used in
    the hierarchical clustering.

  --distance_threshold <number> (default: 0.2)
    GO terms in the hierarchical clustering dendrogram with a distance of less than this value will be merged into
    clusters.

  --min_num_terms <number> (default: 10)
    Clusters with this many or more GO terms will be labeled in the output PNG image. Clusters with fewer than this many
    GO terms will not be labeled, but will still be present in the clusters GMT file.

  --maximum_p_value <-log10_number> (default: 10)
    When setting the colors in the heatmap, all values with a -log10 p-value that is greater than this number will be
    floored to this value. Decreasing this number may improve contrast in the heatmap at the cost of reduced dynamic
    range.

  --extra_color <TRUE or FALSE> (default: FALSE)
    By default, the color scale will range from grey to red. Setting this to true will cause the color scale to range
    from grey, to yellow, to red.

  --heatmap_width <pixels> (default: 1600)
    The width (in pixels) of the output PNG image.

  --heatmap_height <pixels> (default: 1200)
    The height (in pixels) of the output PNG image.

  --files_have_header <TRUE or FALSE> (default: FALSE)
    If set to true, the first line in the gene list files will be ignored.

  --gene_symbol_column <number> (default: 1)
    The column in the gene list files with the gene symbols of the genes in the gene list.

  --heatmap_margin <number> (default: 0.01)
    The percentage of the image to devote to the margin.

  --heatmap_dendrogram_width <number> (default: 0.4)
    The percentage of the image to devote to the dendrogram. Increase this value if you want a larger dendrogram.

  --heatmap_cluster_width <number> (default: 0.5)
    The percentage of the dendrogram to devote to the cluster labels. Increase this value if you want the cluster labels
    to be larger.

  --heatmap_header_height <number> (default: 0.2)
    The percentage of the image to devote to the heatmap header labels. Increase this value if you want the header labels
    to be larger.

  --heatmap_dendrogram_lwd <number> (default: 2)
    The dendrogram line thickness (in R lwd scale). Increase this value if you want the dendrogram lines to be thicker.

  --heatmap_cluster_label_cex <number> (default: 2)
    The cluster label text size (in R cex scale). Increase this value if you want the cluster label text to be larger.

  --heatmap_header_label_cex <number> (default: 2)
    The heatmap header label text size (in R cex scale). Increase this value if you want the header label text to be
    larger.

References:

Bennett BD and Bushel PR. goSTAG: Gene Ontology Subtrees to Tag and Annotate Genes within a set

Davis M, Li J, Knight E, Eldridge SR, Daniels KK, Bushel PR. Toxicogenomics
profiling of bone marrow from rats treated with topotecan in combination with
oxaliplatin: a mechanistic strategy to inform combination toxicity. Front Genet. 
2015 Feb 12;6:14


Source: README.txt, updated 2016-05-03