Menu

Tree [ef6825] master /
 History

HTTPS access


File Date Author Commit
 bin_linux 2017-06-28 egonozer egonozer [3caf30] raw file format update. Option to turn off dust...
 bin_mac 2017-02-01 egonozer egonozer [087310] Added output for gui frontend. Allow Excel line...
 readme_figures 2017-06-28 egonozer egonozer [3caf30] raw file format update. Option to turn off dust...
 utilities 2020-09-25 egonozer egonozer [9d7cf1] added reference subelement capability
 .gitignore 2016-08-01 egonozer egonozer [01d895] initial commit to git
 ClustAGE.pl 2020-10-20 egonozer egonozer [3134ca] Added age set extension and ability to select e...
 ClustAGE_tkx 2020-09-25 egonozer egonozer [9d7cf1] added reference subelement capability
 LICENSE.txt 2016-09-22 egonozer egonozer [6fa8c8] added license
 README.md 2018-02-06 egonozer egonozer [0bf5ec] updated README

Read Me

ClustAGE


CONTENTS:

  1. Introduction
  2. Requirements
  3. Installation
  4. Usage
  5. Output Files
  6. Utilities
  7. ClustAGE Plot
  8. Support Software
  9. License
  10. Contact

1. INTRODUCTION:

ClustAGE takes a set of nucleotide sequences of accessory genomic elements (AGEs) from bacteria or other small genome organisms and clusters them to identify the minimum set of accessory genomic elements in the genomes. ClustAGE will also determine the distribution of each accessory genomic element among the provided genomic sequences.

ClustAGE algorithm schema
Figure 1: ClustAGE algorithm schema

For more information about the identification of accessory genomic elements, see documentation for Spine and AGEnt.

ClustAGE is also available as a web-based application. The web version is limited to a maximum of 15 accessory genome sequence sets and does not support read-correction of AGEs. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage.cgi.


2. REQUIREMENTS:

  • Perl 5.10 or above
  • Mac OSX or Linux. We provide no guarantees that this will work on
    Windows or other operating systems.

3. INSTALLATION:

Simply download the version appropriate for your operating system (Mac OSX or Linux 64-bit) and move the ClustAGE directory to the desired location.

If you would like to use this software on another operating system, you will have to download and compile Blast+ manually:

blastn v2.3.0 and makeblastdb v2.3.0

  1. Either download the pre-compiled version appropriate for your system or build from source available from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.3.0/
  2. Copy the blastn and makeblastdb executables into the 'bin' directory in the same directory as ClustAGE.pl

Optional Software Installation:

gnuplot >= v5.0 (for graphical output)
Linux (Ubuntu / Debian):

sudo apt-get install libcairo2-dev libpango1-dev
sudo apt-get install gnuplot

Linux (Fedora / Red Hat):

sudo yum install cairo-devel pango-devel
sudo yum install gnuplot

Mac OS X (using MacPorts):

sudo port install gnuplot +pangocairo

Mac OS X (using Homebrew):

brew install gnuplot --with-cairo

From source:

  1. Install cairo using instructions given at https://www.cairographics.org/download
  2. Download gnuplot source code from https://sourceforge.net/projects/gnuplot/files/gnuplot
  3. Follow installation instructions given in the INSTALL file
  4. Copy the gnuplot executable into the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH

bwa >= v0.7.13 (for read confirmation of AGE distributions):

  1. Download bwa from here: https://sourceforge.net/projects/bio-bwa-files/
  2. tar -jxvf bwa-0.7.13.tar.bz2
  3. cd bwa-0.7.13
  4. make
  5. Copy the bwa executable into the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH


phylip >= v3.695 (for AGE distribution tree, required by utilities/sublements_to_tree.pl):

  1. Download source code from http://evolution.genetics.washington.edu/phylip/ (most recent archive).
  2. tar -zxvf phylip-3.696.tar.gz
  3. cd phylip-3.696/src
  4. make install
  5. Copy the following executables in phylip-3.696/exe to the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH:
    • neighbor
    • seqboot
    • retree

4. USAGE:

Basic command: perl ClustAGE.pl -f age_files.txt

For list of options, call the script without any inputs: perl ClustAGE.pl

4.1 Required Inputs:

-f or --file
File of accessory genome element fasta files for comparison
format:

/path/to/accessory_elements_1.fasta<tab>genome_name_1<tab>(optional)rank
/path/to/accessory_elements_2.fasta<tab>genome_name_2<tab>(optional)rank

(no spaces in genome names)

More on ranks:

  • 'rank' is a numeric value assigned to the strain. This can be a real number (i.e. cytotoxicity assay value) or a relative number (i.e. relative virulence rank).
  • Decimals, negative numbers, and scientific notation (i.e. 1E-14) are allowed.
  • RANK VALUE IS OPTIONAL. Please leave blank if no ranking information is available or given.
  • Ranking can also be assigned or re-assigned to the data after ClustAGE processing using the included script "re-rank.pl" in the 'utilities' directory. See "Utilities" section below for more information.
  • If the rank is set as R, the sequence will be considered 'reference' and sequences belonging to this genome will NOT be used as bin representatives, but alignments of this genome against bin representatives will be reported.

4.2 Optional Inputs:

--annot
File of annotation information to include in the output. The value of this option should be the path to a list of annotation files in the following format:

/path/to/gen1.accessory_loci.txt<tab>genome_name_1
/path/to/gen2.accessory_loci.txt<tab>genome_name_2
  • Annotation files should be in the format output by Spine or AGEnt ("loci.txt"), i.e.
locusID<tab>Source contig ID<tab>Source start<tab>Source stop<tab>Strand<tab>Accessory sequence ID<tab>Accessory start<tab>Accessory stop<tab>% of gene<tab>Overlap<tab>Gene product
  • Genome names must EXACTLY match those in the file given to option '-f'
  • 'Overlap' is number of bases of the gene not contained in the accessory element. See Spine documentation for more information

--age
Fasta-formatted file of AGE sequences to be used as bin representatives. If this file is given, sequence files given above (-f) will not be used to identify new bin representatives. Instead they will only be aligned to the sequences given here.

-e or --evalue
maximum BLAST e-value cutoff
(default: 1E-6)

-i or --pctid
minimum nucleotide sequence identity, in %
(default: 85)

--dustoff
turns off default low complexity filtering by blast. Useful for species with high degrees of low complexity sequence.
(default: dust masking on)

-a or --maxalign
maximum number of BLAST alignments to report. Values too low may give incorrect results, especially when comparing a very large number of sequences.
(default: 100,000)

-x or --min_age
minimum accessory element size, in bp. This is the shortest possible sequence that will be used by ClustAGE as a bin representative.
(default: 200)

--min_align
minimum size of alignments against accessory elements to report
(default: 100)

-o or --out
prefix for output files
(default: "out")

--skip_se
skip determination of subelements within bin representatives
(default: sublements and sublement key files WILL be output)

-s or --min_se
minimum size of subelements, in bp, to be reported in output
(default: 1)

--min_se_seq
minimum size of subelement seqeunces, in bp, to be output in 'subelements.fasta' file
(default: 100)

-g or --min_gen
minimum number of genomes in which a subelement must be present to be included in csv output
(default: 1)

-v or --verbose
verbose output

--license
print license information and quit

Figure Output Option: (Requires gnuplot)

-p or --graph
Output graphical representations of AGEs and their distributions in the input data sets. See Output Files section below for more information.
(default: no figures will be output)

--gnuplot
path to gnuplot executable
(default: will search for gnuplot first in the 'bin' directory, then in PATH)

--graph_se
plot subelement dividers in output figures
(default: no subelement dividers will be plotted)

--g_se_min
if subelement dividers are requested, mimumum subelement size, in bp, to plot
(default: 20)

--g_type
output type. Choices are 'png' or 'pdf'. PDF output requires that gnuplot was compiled with pdfcairo terminal.
(default: png)

Result Confirmation Options: (Requires bwa)

Whole-genome assemblers using short reads can sometimes omit sequences that are present in the read set, but are not assembled into contigs in all genomes. ClustAGE allows the option to align raw reads to the set of AGEs identified by ClustAGE to determine if AGEs missing from the assemblies of some of the included genomes can be found in the reads. This process is only additive, i.e. AGEs will only be added to a genome's AGE profile based on read alignments, never subtracted. Subelement sequences identified by read alignment will be identified in a separate set of files identified as 'read_corrected'.
WARNING: This step can be VERY slow, but is recommended for draft genome sequences produced using de novo assembly of short reads (i.e. Illumina, IonTorrent, etc.)

-r or --reads
file with paths to sequencing reads, for confirmation. If sequencing reads are given, distribution of elements among genomes will be confirmed by read alignment.
File format:

genome_name<tab>/path/to/reads.fastq<tab>(optional)/path/to/reads_2.fastq
  • Read files must be in fastq format
  • Genome names must EXACTLY match those in the file given to option '-f'
  • If forward and reverse read files are available for a particular genome, they can be given, in order, separated by a tab
  • Gzipped read files (ending with '.gz') are allowed

-c or --core
fasta file of sequences considered to be "core" or present in the majority of the input genomes. Can use the "backbone.fasta" file produced by Spine.
If given along with a reads file above, ClustAGE will align reads to the core genome sequence file and note reads aligning to the core. If these same reads are then found aligning to AGEs, they will only be considered a true alignment if the alignment quality is greater than for the alignment against core.
Not required, but recommended to reduce false-positive poor quality alignments.
(default: no core sequence)

-d or --depth
minimum read depth
(default: 5)

--bwa
path to bwa executable
(default: will search for bwa first in the 'bin' directory, then in PATH)

-t or --threads
number of threads (for bwa only)
(default will be automatically determined based on number of available CPUs)


5. OUTPUT FILES:

command.txt
Version numbers of ClustAGE, version numbers of support software used by ClustAGE, and list of parameters given to ClustAGE

AGEs.key.txt
Characteristics of accessory genomic element (AGE) representatives

Column header Description
bin_id Unique identifier given to the representative sequence. These IDs correspond to the sequence IDs in the "AGEs.fasta" file.
source_id ID of the sequence that served as the source for this representative
source_genome Genome name for the source sequence
source_length Length of the source sequence, in bases
bin_start Start coordinate of the region on the source sequence that corresponds to this representative (1-based)
bin_stop Stop coordinate of the region on the source sequence that corresponds to this representative (1-based)
bin_length Length of the representative accessory region, in bases

AGEs.fasta
Nucleotide sequences of the representative AGE sequences output by ClustAGE. Original sources of the sequences are given on the ID line or can be determined by cross-referencing with AGEs.key.txt file.

AGEs.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within representative accessory regions.

Column header Description
bin_id Unique identifier given to the representative sequence. Corresponds to sequence headers in AGEs.fasta file and bin_ids in AGEs.key.txt file
annotation(s) Comma-separated list of genes within the AGE. Each entry takes the form of locus ID followed by the percentage of the gene contained within the AGE (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein"

subelements.key.txt
Characteristics of subelements of AGEs. Subelements are subdivisions of AGEs based on distribtion of parts of the AGE among the strains being examined. Subelement borders occur points where there are changes in the group of strains in which a discrete part of the representative AGE is found.

Column header Description
subelement ID of the subelement section.
bin_id ID of the AGE from which the subelement was derived
source_id ID of the sequence that served as the source for the AGE
source_genome Genome name for the source sequence
start start coordinate of the subelement section along the AGE (1-based)
stop stop coordinate of the subelement section along the AGE (1-based)
length length of the sublement
avg_rank if ranking information was provided, this is the average of the rank values for ranked genomes that contain this subelement
num_genomes total number of genomes in which this subelement was identified
"genome_name (rank)" All subsequent columns will show the presence (1) or absence (0) of the subelement

subelements.fasta
Nucleotide sequences of subelements sequences output by ClustAGE. By default, ClustAGE only outputs subelement sequences at least 100 bp in length. This can be adjusted using the --min_se_seq option.

subelements.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within subelement regions.

Column header Description
subelement Unique identifier given to the subelement. Corresponds to sequence headers in subelements.fasta file and subelements.key.txt file
annotation(s) comma-separated list of genes within the subelement. Each entry takes the form of locus ID followed by the percentage of the gene contained within the subelement (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein"

subelements.csv
Comma-separated list of subelement distributions among the included genomes.

  • First column is the genome name.
  • Second column is the rank of genome (if given)
  • Each subsequent column corresponds to a subelement named in the first row. If the subelement is present in a particular genome, this will be indicated by a 1. Absence of a subelement is indicated by a 0.

subelements.alignments.txt
Sources of sequences in each genome containing subelements.
Column headers and descriptions:

  • First column (subelement) is the sublement ID
  • Each subsequent group of three columns corresponds to one of the included genomes
    • Column 1 of 3 (_contig): ID of the source sequence on which the subelement is found. If the subelement was not found in this genome, the value will be "-"
    • Column 2 of 3 (_start): start coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0.
    • Column 3 of 3 (_stop): stop coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0.

graphs folder
Contains graphical representations of distributions of each AGE output by ClustAGE. Each file corresponds to one AGE as indicated by the filename. If annotation information for the AGE representative genome was given, this will be shown at the top of the figure with genes on the forward strand shown as green arrows and genes on the reverse strand shown as orange arrows. If a gene begins and/or ends outside the boundaries of the AGE, this will be indicated by a vertical dashed line at the beginning or end of the line. The AGE in the reference genome will be a red bar. Presence of the AGE or portions of the AGE will be indicated in blue on the lines corresponding to the genome. If read confirmation was selected and read alignment revealed AGEs not present in the assembled sequences, these regions will be indicated with green bars. The intensity of color in the bars corresponds to the sequence identity of the alignment as indicated by scale bars on the right side of the figure.

ClustAGE Graph Example
Figure 2: Example AGE graph output


6. UTILITIES

These scripts are located in the 'utilities' directory included with ClustAGE.

6.1 subelements_to_tree.pl

This pipeline script calculates a Bray-Curtis distance matrix from distributions of accessory elements that it uses to create a neighbor joining tree of accessory element distribution patterns. Note these distances are not based on sequence similarity, but only on presence or absence of an accessory element within a genome within the threshold parameters given to ClustAGE. It will also produce output files that can be used to create a heatmap of Bray-Curtis similarity values.

Software:

The phylip executables 'neighbor', 'seqboot', and 'retree' are required. See 3. Installation above for instructions on downloading and installing these components.

The directory 'stt_support' must be in the same directory as subelements_to_tree.pl

Optional: You may want to use FigTree or a similar viewer to view and manipulate the intermediate tree.

Optional: Tree and heatmap files can also be viewed online using iTOL (http://itol.embl.de/)

Usage:

perl subelements_to_tree.pl -c clustage.subelements.csv -k clustage.subelements.key.txt

Required Inputs:
Argument Flag Value Example Comment
ClustAGE subelement csv file -c file name -c clustage.subelements.csv Can be the read-corrected csv file, if available
ClustAGE subelement key file -k file name -k clustage.subelements.key.txt This can be the read-corrected key file, if available
Optional Inputs:
Argument Flag Value Example Comment
Minimum subelement size -s integer -s 100 Default: 100
Number of bootstraps -b integer -b 100 Default: 100. Can be set to 0 to turn off bootstrap calculation
Collapse branches -d float -d 0.5 Collapse branches with bootstrap support below the value given. Must be a number betwen 0 and 1. Default: 0 (i.e. all branches will be shown)
Output file prefix -o string -o output Default: "output"
Leaf order rearrangement -r string -r midpoint See Note 1 below for description of options. Default: "midpoint"
Keep intermediate tree files -x no value -x Keeps unbootstrapped tree and bootstrap trees. Default: intermediate tree files will be deleted
Path to phylip directory -p directory path -p /path/to/phylip-3.69/exe This should be the directory that contains the phylip executables 'seqboot', 'neighbor' and 'retree'. Default: will look first in ClustAGE/bin folder or in PATH

Note 1
Options for -r:

  • "midpoint" : (default) Midpoint root the tree such that the root is equidistant from the two farthest points on the tree
  • "user" : Program will pause after creating the initial tree and provide instructions in the terminal for manually rerooting the tree using FigTree before continuing.
  • "none" : No rearrangement of the initial tree will be performed

Outputs:

\<prefix>.tre</prefix>
Newick-formatted neighbor joining tree based on the Bray-Curtis distance matrix of accessory genome differences. If bootstraps were calculated, these will be included as branch labels. The tree can be viewed using FigTree or similar tree viewer or online using iTOL or EvolView.

\<prefix>.sim_matrix.csv</prefix>
Comma-separated table of pairwise accessory genome Bray-Curtis similarity values (1 - Bray-Curtis distance).

\<prefix>.heatmap.txt</prefix>
iTOL heatmap annotation file. If iTOL is being used to view the neighbor joiing tree, this file can be dragged into the tree viewer window of iTOL to produce a heatmap of Bray-Curtis similarities. Under the "Advanced" tab in the "Controls" pane, make sure to set "Leaf sorting" to "None" to ensure the order of tree leaves matches the heatmap.

\<prefix>.BCdist.phy</prefix>
Phylip-formatted matrix of pairwise accessory genome Bray-Curtis distances. Dummy names are given to comply with phylip format name lengths. Key of dummy names to actual genome names is given in the file \<prefix>.dummy_names.txt</prefix>.

subelements_to_tree.pl iTOL Plot Example
Figure 3: Example tree and heatmap output as plotted in iTOL. Tree has branches with less than 0.5 bootstrap support collapsed.


6.2 re-rank.pl

This script will change ranking information in ClustAGE output files. This will save you from having to re-run ClustAGE if you want to examine the same dataset in relation to another phenotype or characteristic.

Software:

Requires gnuplot if re-ordering of figures is desired.

Usage:

perl re-rank.pl -c clustage.subelements.csv -k clustage.subelements.key.txt

Required Inputs:
Argument Flag Value Example Comment
ClustAGE subelement csv file -c file name -c subelements.csv Can be the read-corrected csv file, if available
ClustAGE subelement key file -k file name -k subelements.key.txt This can be the read-corrected key file, if available
Optional Inputs:
Argument Flag Value Example Comment
New ranks -f file name -f new_ranks.txt This can be the same format as the file list given to the -f input of ClustAGE. See Note 2 below.
Output prefix -o string -o rerank Default: "rerank"
Read-corrected csv -C file name -C subelements.read_corrected.csv
Read-corrected key -K file name -K subelements.read_corrected.key.txt
Output figures -p no value -p Default: No new figures will be output. Requires gnuplot
Annotation file list -a file name -a annotation_list.txt Same annotation file list given to ClustAGE. To add gene information to output figures
AGEs key file -A file name -A clustage.AGEs.key.txt File output by ClustAGE. Only adding gene information to output figures
gnuplot location --gnuplot file name --gnuplot /path/to/gnuplot Default: will look first in ClustAGE/bin folder or in PATH

Note 2
Description of new ranking file for -f:
File format can be the same as the tab-separated file list given to ClustAGE, i.e.

/path/to/accessory/file<tab>genome_name<tab>rank
  • The file path (first column) can be left blank as long as there is a <tab> before the genome name. Any characters before the first <tab> in the file will be ignored. </tab></tab>
  • Only genomes with ranks to be changed need to be included in this file.
  • Genomes not included will not have their original rank values changed
  • Genomes that are included but not given a rank value will have ranks changed to "NA"

Outputs:

Will output new subelements.key.txt and subelements.csv files and, if requested, sublements.read_corrected.csv, subelements.read_corrected.key.txt, and graphs folder with the new rankings. See description of ClustAGE outputs above for more information.


7. ClustAGE Plot

Online utility to visualize accessory element distribution patterns throughout the population. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage_plot.cgi for instructions and more information.

ClustAGE Plot Example
Figure 4: Example ClustAGE Plot output


8. SUPPORT SOFTWARE

BLAST+ software (blastn and makeblastdb) are provided by the National Library of Medicine / National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov).
Reference: Altschul, S F et al. "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs." Nucleic Acids Research 25.17 (1997): 3389-3402.

gnuplot: Copyright 1986 - 1993, 1998, 2004 Thomas Williams, Colin Kelley. See copyright included with gnuplot for more information.

bwa: Provided under GNU GPL version 3. See copyright included with bwa for more information.

phylip: Copyright (c) 1980-2014, Joseph Felsenstein. All rights reserved. See copyright included with phylip for more information.

CompareToBootstrap.pl and MOTree.pm are provided by Morgan N. Price under GNU GPL version 2. Copyright 2008-2011 The Regenets of the University of California.


9. LICENSE:

ClustAGE
Copyright (C) 2016-2018 Egon A. Ozer

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. See LICENSE.txt


10. CONTACT:

Contact Egon Ozer with questions or comments.

Written with MacDown.