File | Date | Author | Commit |
---|---|---|---|
bin_linux | 2017-06-28 | egonozer | [3caf30] raw file format update. Option to turn off dust... |
bin_mac | 2017-02-01 | egonozer | [087310] Added output for gui frontend. Allow Excel line... |
readme_figures | 2017-06-28 | egonozer | [3caf30] raw file format update. Option to turn off dust... |
utilities | 2020-09-25 | egonozer | [9d7cf1] added reference subelement capability |
.gitignore | 2016-08-01 | egonozer | [01d895] initial commit to git |
ClustAGE.pl | 2020-10-20 | egonozer | [3134ca] Added age set extension and ability to select e... |
ClustAGE_tkx | 2020-09-25 | egonozer | [9d7cf1] added reference subelement capability |
LICENSE.txt | 2016-09-22 | egonozer | [6fa8c8] added license |
README.md | 2018-02-06 | egonozer | [0bf5ec] updated README |
ClustAGE takes a set of nucleotide sequences of accessory genomic elements (AGEs) from bacteria or other small genome organisms and clusters them to identify the minimum set of accessory genomic elements in the genomes. ClustAGE will also determine the distribution of each accessory genomic element among the provided genomic sequences.
Figure 1: ClustAGE algorithm schema
For more information about the identification of accessory genomic elements, see documentation for Spine and AGEnt.
ClustAGE is also available as a web-based application. The web version is limited to a maximum of 15 accessory genome sequence sets and does not support read-correction of AGEs. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage.cgi.
Simply download the version appropriate for your operating system (Mac OSX or Linux 64-bit) and move the ClustAGE directory to the desired location.
If you would like to use this software on another operating system, you will have to download and compile Blast+ manually:
blastn v2.3.0 and makeblastdb v2.3.0
gnuplot >= v5.0 (for graphical output)
Linux (Ubuntu / Debian):
sudo apt-get install libcairo2-dev libpango1-dev
sudo apt-get install gnuplot
Linux (Fedora / Red Hat):
sudo yum install cairo-devel pango-devel
sudo yum install gnuplot
Mac OS X (using MacPorts):
sudo port install gnuplot +pangocairo
Mac OS X (using Homebrew):
brew install gnuplot --with-cairo
From source:
bwa >= v0.7.13 (for read confirmation of AGE distributions):
tar -jxvf bwa-0.7.13.tar.bz2
cd bwa-0.7.13
make
phylip >= v3.695 (for AGE distribution tree, required by utilities/sublements_to_tree.pl
):
tar -zxvf phylip-3.696.tar.gz
cd phylip-3.696/src
make install
phylip-3.696/exe
to the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH: Basic command: perl ClustAGE.pl -f age_files.txt
For list of options, call the script without any inputs: perl ClustAGE.pl
-f
or --file
File of accessory genome element fasta files for comparison
format:
/path/to/accessory_elements_1.fasta<tab>genome_name_1<tab>(optional)rank
/path/to/accessory_elements_2.fasta<tab>genome_name_2<tab>(optional)rank
(no spaces in genome names)
More on ranks:
1E-14
) are allowed.R
, the sequence will be considered 'reference' and sequences belonging to this genome will NOT be used as bin representatives, but alignments of this genome against bin representatives will be reported.--annot
File of annotation information to include in the output. The value of this option should be the path to a list of annotation files in the following format:
/path/to/gen1.accessory_loci.txt<tab>genome_name_1
/path/to/gen2.accessory_loci.txt<tab>genome_name_2
locusID<tab>Source contig ID<tab>Source start<tab>Source stop<tab>Strand<tab>Accessory sequence ID<tab>Accessory start<tab>Accessory stop<tab>% of gene<tab>Overlap<tab>Gene product
--age
Fasta-formatted file of AGE sequences to be used as bin representatives. If this file is given, sequence files given above (-f
) will not be used to identify new bin representatives. Instead they will only be aligned to the sequences given here.
-e
or --evalue
maximum BLAST e-value cutoff
(default: 1E-6)
-i
or --pctid
minimum nucleotide sequence identity, in %
(default: 85)
--dustoff
turns off default low complexity filtering by blast. Useful for species with high degrees of low complexity sequence.
(default: dust masking on)
-a
or --maxalign
maximum number of BLAST alignments to report. Values too low may give incorrect results, especially when comparing a very large number of sequences.
(default: 100,000)
-x
or --min_age
minimum accessory element size, in bp. This is the shortest possible sequence that will be used by ClustAGE as a bin representative.
(default: 200)
--min_align
minimum size of alignments against accessory elements to report
(default: 100)
-o
or --out
prefix for output files
(default: "out")
--skip_se
skip determination of subelements within bin representatives
(default: sublements and sublement key files WILL be output)
-s
or --min_se
minimum size of subelements, in bp, to be reported in output
(default: 1)
--min_se_seq
minimum size of subelement seqeunces, in bp, to be output in 'subelements.fasta' file
(default: 100)
-g
or --min_gen
minimum number of genomes in which a subelement must be present to be included in csv output
(default: 1)
-v
or --verbose
verbose output
--license
print license information and quit
-p
or --graph
Output graphical representations of AGEs and their distributions in the input data sets. See Output Files section below for more information.
(default: no figures will be output)
--gnuplot
path to gnuplot executable
(default: will search for gnuplot first in the 'bin' directory, then in PATH)
--graph_se
plot subelement dividers in output figures
(default: no subelement dividers will be plotted)
--g_se_min
if subelement dividers are requested, mimumum subelement size, in bp, to plot
(default: 20)
--g_type
output type. Choices are 'png' or 'pdf'. PDF output requires that gnuplot was compiled with pdfcairo terminal.
(default: png)
Whole-genome assemblers using short reads can sometimes omit sequences that are present in the read set, but are not assembled into contigs in all genomes. ClustAGE allows the option to align raw reads to the set of AGEs identified by ClustAGE to determine if AGEs missing from the assemblies of some of the included genomes can be found in the reads. This process is only additive, i.e. AGEs will only be added to a genome's AGE profile based on read alignments, never subtracted. Subelement sequences identified by read alignment will be identified in a separate set of files identified as 'read_corrected'.
WARNING: This step can be VERY slow, but is recommended for draft genome sequences produced using de novo assembly of short reads (i.e. Illumina, IonTorrent, etc.)
-r
or --reads
file with paths to sequencing reads, for confirmation. If sequencing reads are given, distribution of elements among genomes will be confirmed by read alignment.
File format:
genome_name<tab>/path/to/reads.fastq<tab>(optional)/path/to/reads_2.fastq
-c
or --core
fasta file of sequences considered to be "core" or present in the majority of the input genomes. Can use the "backbone.fasta" file produced by Spine.
If given along with a reads file above, ClustAGE will align reads to the core genome sequence file and note reads aligning to the core. If these same reads are then found aligning to AGEs, they will only be considered a true alignment if the alignment quality is greater than for the alignment against core.
Not required, but recommended to reduce false-positive poor quality alignments.
(default: no core sequence)
-d
or --depth
minimum read depth
(default: 5)
--bwa
path to bwa executable
(default: will search for bwa first in the 'bin' directory, then in PATH)
-t
or --threads
number of threads (for bwa only)
(default will be automatically determined based on number of available CPUs)
command.txt
Version numbers of ClustAGE, version numbers of support software used by ClustAGE, and list of parameters given to ClustAGE
AGEs.key.txt
Characteristics of accessory genomic element (AGE) representatives
Column header | Description |
---|---|
bin_id |
Unique identifier given to the representative sequence. These IDs correspond to the sequence IDs in the "AGEs.fasta" file. |
source_id |
ID of the sequence that served as the source for this representative |
source_genome |
Genome name for the source sequence |
source_length |
Length of the source sequence, in bases |
bin_start |
Start coordinate of the region on the source sequence that corresponds to this representative (1-based) |
bin_stop |
Stop coordinate of the region on the source sequence that corresponds to this representative (1-based) |
bin_length |
Length of the representative accessory region, in bases |
AGEs.fasta
Nucleotide sequences of the representative AGE sequences output by ClustAGE. Original sources of the sequences are given on the ID line or can be determined by cross-referencing with AGEs.key.txt file.
AGEs.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within representative accessory regions.
Column header | Description |
---|---|
bin_id |
Unique identifier given to the representative sequence. Corresponds to sequence headers in AGEs.fasta file and bin_ids in AGEs.key.txt file |
annotation(s) |
Comma-separated list of genes within the AGE. Each entry takes the form of locus ID followed by the percentage of the gene contained within the AGE (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein" |
subelements.key.txt
Characteristics of subelements of AGEs. Subelements are subdivisions of AGEs based on distribtion of parts of the AGE among the strains being examined. Subelement borders occur points where there are changes in the group of strains in which a discrete part of the representative AGE is found.
Column header | Description |
---|---|
subelement |
ID of the subelement section. |
bin_id |
ID of the AGE from which the subelement was derived |
source_id |
ID of the sequence that served as the source for the AGE |
source_genome |
Genome name for the source sequence |
start |
start coordinate of the subelement section along the AGE (1-based) |
stop |
stop coordinate of the subelement section along the AGE (1-based) |
length |
length of the sublement |
avg_rank |
if ranking information was provided, this is the average of the rank values for ranked genomes that contain this subelement |
num_genomes |
total number of genomes in which this subelement was identified |
"genome_name (rank)" | All subsequent columns will show the presence (1) or absence (0) of the subelement |
subelements.fasta
Nucleotide sequences of subelements sequences output by ClustAGE. By default, ClustAGE only outputs subelement sequences at least 100 bp in length. This can be adjusted using the --min_se_seq
option.
subelements.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within subelement regions.
Column header | Description |
---|---|
subelement |
Unique identifier given to the subelement. Corresponds to sequence headers in subelements.fasta file and subelements.key.txt file |
annotation(s) |
comma-separated list of genes within the subelement. Each entry takes the form of locus ID followed by the percentage of the gene contained within the subelement (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein" |
subelements.csv
Comma-separated list of subelement distributions among the included genomes.
subelements.alignments.txt
Sources of sequences in each genome containing subelements.
Column headers and descriptions:
subelement
) is the sublement ID _contig
): ID of the source sequence on which the subelement is found. If the subelement was not found in this genome, the value will be "-" _start
): start coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0. _stop
): stop coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0. graphs folder
Contains graphical representations of distributions of each AGE output by ClustAGE. Each file corresponds to one AGE as indicated by the filename. If annotation information for the AGE representative genome was given, this will be shown at the top of the figure with genes on the forward strand shown as green arrows and genes on the reverse strand shown as orange arrows. If a gene begins and/or ends outside the boundaries of the AGE, this will be indicated by a vertical dashed line at the beginning or end of the line. The AGE in the reference genome will be a red bar. Presence of the AGE or portions of the AGE will be indicated in blue on the lines corresponding to the genome. If read confirmation was selected and read alignment revealed AGEs not present in the assembled sequences, these regions will be indicated with green bars. The intensity of color in the bars corresponds to the sequence identity of the alignment as indicated by scale bars on the right side of the figure.
Figure 2: Example AGE graph output
These scripts are located in the 'utilities' directory included with ClustAGE.
This pipeline script calculates a Bray-Curtis distance matrix from distributions of accessory elements that it uses to create a neighbor joining tree of accessory element distribution patterns. Note these distances are not based on sequence similarity, but only on presence or absence of an accessory element within a genome within the threshold parameters given to ClustAGE. It will also produce output files that can be used to create a heatmap of Bray-Curtis similarity values.
The phylip executables 'neighbor', 'seqboot', and 'retree' are required. See 3. Installation above for instructions on downloading and installing these components.
The directory 'stt_support' must be in the same directory as subelements_to_tree.pl
Optional: You may want to use FigTree or a similar viewer to view and manipulate the intermediate tree.
Optional: Tree and heatmap files can also be viewed online using iTOL (http://itol.embl.de/)
perl subelements_to_tree.pl -c clustage.subelements.csv -k clustage.subelements.key.txt
Argument | Flag | Value | Example | Comment |
---|---|---|---|---|
ClustAGE subelement csv file | -c |
file name | -c clustage.subelements.csv |
Can be the read-corrected csv file, if available |
ClustAGE subelement key file | -k |
file name | -k clustage.subelements.key.txt |
This can be the read-corrected key file, if available |
Argument | Flag | Value | Example | Comment |
---|---|---|---|---|
Minimum subelement size | -s |
integer | -s 100 |
Default: 100 |
Number of bootstraps | -b |
integer | -b 100 |
Default: 100. Can be set to 0 to turn off bootstrap calculation |
Collapse branches | -d |
float | -d 0.5 |
Collapse branches with bootstrap support below the value given. Must be a number betwen 0 and 1. Default: 0 (i.e. all branches will be shown) |
Output file prefix | -o |
string | -o output |
Default: "output" |
Leaf order rearrangement | -r |
string | -r midpoint |
See Note 1 below for description of options. Default: "midpoint" |
Keep intermediate tree files | -x |
no value | -x |
Keeps unbootstrapped tree and bootstrap trees. Default: intermediate tree files will be deleted |
Path to phylip directory | -p |
directory path | -p /path/to/phylip-3.69/exe |
This should be the directory that contains the phylip executables 'seqboot', 'neighbor' and 'retree'. Default: will look first in ClustAGE/bin folder or in PATH |
Note 1
Options for -r
:
\<prefix>.tre</prefix>
Newick-formatted neighbor joining tree based on the Bray-Curtis distance matrix of accessory genome differences. If bootstraps were calculated, these will be included as branch labels. The tree can be viewed using FigTree or similar tree viewer or online using iTOL or EvolView.
\<prefix>.sim_matrix.csv</prefix>
Comma-separated table of pairwise accessory genome Bray-Curtis similarity values (1 - Bray-Curtis distance).
\<prefix>.heatmap.txt</prefix>
iTOL heatmap annotation file. If iTOL is being used to view the neighbor joiing tree, this file can be dragged into the tree viewer window of iTOL to produce a heatmap of Bray-Curtis similarities. Under the "Advanced" tab in the "Controls" pane, make sure to set "Leaf sorting" to "None" to ensure the order of tree leaves matches the heatmap.
\<prefix>.BCdist.phy</prefix>
Phylip-formatted matrix of pairwise accessory genome Bray-Curtis distances. Dummy names are given to comply with phylip format name lengths. Key of dummy names to actual genome names is given in the file \<prefix>.dummy_names.txt</prefix>.
Figure 3: Example tree and heatmap output as plotted in iTOL. Tree has branches with less than 0.5 bootstrap support collapsed.
This script will change ranking information in ClustAGE output files. This will save you from having to re-run ClustAGE if you want to examine the same dataset in relation to another phenotype or characteristic.
Requires gnuplot if re-ordering of figures is desired.
perl re-rank.pl -c clustage.subelements.csv -k clustage.subelements.key.txt
Argument | Flag | Value | Example | Comment |
---|---|---|---|---|
ClustAGE subelement csv file | -c |
file name | -c subelements.csv |
Can be the read-corrected csv file, if available |
ClustAGE subelement key file | -k |
file name | -k subelements.key.txt |
This can be the read-corrected key file, if available |
Argument | Flag | Value | Example | Comment |
---|---|---|---|---|
New ranks | -f |
file name | -f new_ranks.txt |
This can be the same format as the file list given to the -f input of ClustAGE. See Note 2 below. |
Output prefix | -o |
string | -o rerank |
Default: "rerank" |
Read-corrected csv | -C |
file name | -C subelements.read_corrected.csv |
|
Read-corrected key | -K |
file name | -K subelements.read_corrected.key.txt |
|
Output figures | -p |
no value | -p |
Default: No new figures will be output. Requires gnuplot |
Annotation file list | -a |
file name | -a annotation_list.txt |
Same annotation file list given to ClustAGE. To add gene information to output figures |
AGEs key file | -A |
file name | -A clustage.AGEs.key.txt |
File output by ClustAGE. Only adding gene information to output figures |
gnuplot location | --gnuplot | file name | --gnuplot /path/to/gnuplot |
Default: will look first in ClustAGE/bin folder or in PATH |
Note 2
Description of new ranking file for -f
:
File format can be the same as the tab-separated file list given to ClustAGE, i.e.
/path/to/accessory/file<tab>genome_name<tab>rank
Will output new subelements.key.txt and subelements.csv files and, if requested, sublements.read_corrected.csv, subelements.read_corrected.key.txt, and graphs folder with the new rankings. See description of ClustAGE outputs above for more information.
Online utility to visualize accessory element distribution patterns throughout the population. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage_plot.cgi for instructions and more information.
Figure 4: Example ClustAGE Plot output
BLAST+ software (blastn and makeblastdb) are provided by the National Library of Medicine / National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov).
Reference: Altschul, S F et al. "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs." Nucleic Acids Research 25.17 (1997): 3389-3402.
gnuplot: Copyright 1986 - 1993, 1998, 2004 Thomas Williams, Colin Kelley. See copyright included with gnuplot for more information.
bwa: Provided under GNU GPL version 3. See copyright included with bwa for more information.
phylip: Copyright (c) 1980-2014, Joseph Felsenstein. All rights reserved. See copyright included with phylip for more information.
CompareToBootstrap.pl and MOTree.pm are provided by Morgan N. Price under GNU GPL version 2. Copyright 2008-2011 The Regenets of the University of California.
ClustAGE
Copyright (C) 2016-2018 Egon A. Ozer
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. See LICENSE.txt
Contact Egon Ozer with questions or comments.
Written with MacDown.