ClustAGE Code

Brought to you by: egonozer

Tree [ef6825] master /

History

HTTPS access

File	Date	Author	Commit
bin_linux	2017-06-28	egonozer	[3caf30] raw file format update. Option to turn off dust...
bin_mac	2017-02-01	egonozer	[087310] Added output for gui frontend. Allow Excel line...
readme_figures	2017-06-28	egonozer	[3caf30] raw file format update. Option to turn off dust...
utilities	2020-09-25	egonozer	[9d7cf1] added reference subelement capability
.gitignore	2016-08-01	egonozer	[01d895] initial commit to git
ClustAGE.pl	2020-10-20	egonozer	[3134ca] Added age set extension and ability to select e...
ClustAGE_tkx	2020-09-25	egonozer	[9d7cf1] added reference subelement capability
LICENSE.txt	2016-09-22	egonozer	[6fa8c8] added license
README.md	2018-02-06	egonozer	[0bf5ec] updated README

Read Me

ClustAGE

1. INTRODUCTION:

ClustAGE takes a set of nucleotide sequences of accessory genomic elements (AGEs) from bacteria or other small genome organisms and clusters them to identify the minimum set of accessory genomic elements in the genomes. ClustAGE will also determine the distribution of each accessory genomic element among the provided genomic sequences.

Figure 1: ClustAGE algorithm schema

For more information about the identification of accessory genomic elements, see documentation for Spine and AGEnt.

ClustAGE is also available as a web-based application. The web version is limited to a maximum of 15 accessory genome sequence sets and does not support read-correction of AGEs. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage.cgi.

2. REQUIREMENTS:

Perl 5.10 or above
Mac OSX or Linux. We provide no guarantees that this will work on
Windows or other operating systems.

3. INSTALLATION:

Simply download the version appropriate for your operating system (Mac OSX or Linux 64-bit) and move the ClustAGE directory to the desired location.

If you would like to use this software on another operating system, you will have to download and compile Blast+ manually:

blastn v2.3.0 and makeblastdb v2.3.0

Either download the pre-compiled version appropriate for your system or build from source available from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.3.0/
Copy the blastn and makeblastdb executables into the 'bin' directory in the same directory as ClustAGE.pl

Optional Software Installation:

gnuplot >= v5.0 (for graphical output)
Linux (Ubuntu / Debian):

sudo apt-get install libcairo2-dev libpango1-dev
sudo apt-get install gnuplot

Linux (Fedora / Red Hat):

sudo yum install cairo-devel pango-devel
sudo yum install gnuplot

Mac OS X (using MacPorts):

sudo port install gnuplot +pangocairo

Mac OS X (using Homebrew):

brew install gnuplot --with-cairo

From source:

Install cairo using instructions given at https://www.cairographics.org/download
Download gnuplot source code from https://sourceforge.net/projects/gnuplot/files/gnuplot
Follow installation instructions given in the INSTALL file
Copy the gnuplot executable into the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH

bwa >= v0.7.13 (for read confirmation of AGE distributions):

Download bwa from here: https://sourceforge.net/projects/bio-bwa-files/
tar -jxvf bwa-0.7.13.tar.bz2
cd bwa-0.7.13
make
Copy the bwa executable into the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH

phylip >= v3.695 (for AGE distribution tree, required by utilities/sublements_to_tree.pl):

Download source code from http://evolution.genetics.washington.edu/phylip/ (most recent archive).
tar -zxvf phylip-3.696.tar.gz
cd phylip-3.696/src
make install
Copy the following executables in phylip-3.696/exe to the 'bin' directory in the same directory as ClustAGE.pl or to a directory in your PATH:
- neighbor
- seqboot
- retree

4. USAGE:

Basic command: perl ClustAGE.pl -f age_files.txt

For list of options, call the script without any inputs: perl ClustAGE.pl

4.1 Required Inputs:

-f or --file
File of accessory genome element fasta files for comparison
format:

/path/to/accessory_elements_1.fasta<tab>genome_name_1<tab>(optional)rank
/path/to/accessory_elements_2.fasta<tab>genome_name_2<tab>(optional)rank

(no spaces in genome names)

4.2 Optional Inputs:

--annot
File of annotation information to include in the output. The value of this option should be the path to a list of annotation files in the following format:

/path/to/gen1.accessory_loci.txt<tab>genome_name_1
/path/to/gen2.accessory_loci.txt<tab>genome_name_2

Annotation files should be in the format output by Spine or AGEnt ("loci.txt"), i.e.

locusID<tab>Source contig ID<tab>Source start<tab>Source stop<tab>Strand<tab>Accessory sequence ID<tab>Accessory start<tab>Accessory stop<tab>% of gene<tab>Overlap<tab>Gene product

Genome names must EXACTLY match those in the file given to option '-f'
'Overlap' is number of bases of the gene not contained in the accessory element. See Spine documentation for more information

--age
Fasta-formatted file of AGE sequences to be used as bin representatives. If this file is given, sequence files given above (-f) will not be used to identify new bin representatives. Instead they will only be aligned to the sequences given here.

-e or --evalue
maximum BLAST e-value cutoff
(default: 1E-6)

-i or --pctid
minimum nucleotide sequence identity, in %
(default: 85)

--dustoff
turns off default low complexity filtering by blast. Useful for species with high degrees of low complexity sequence.
(default: dust masking on)

-a or --maxalign
maximum number of BLAST alignments to report. Values too low may give incorrect results, especially when comparing a very large number of sequences.
(default: 100,000)

-x or --min_age
minimum accessory element size, in bp. This is the shortest possible sequence that will be used by ClustAGE as a bin representative.
(default: 200)

--min_align
minimum size of alignments against accessory elements to report
(default: 100)

-o or --out
prefix for output files
(default: "out")

--skip_se
skip determination of subelements within bin representatives
(default: sublements and sublement key files WILL be output)

-s or --min_se
minimum size of subelements, in bp, to be reported in output
(default: 1)

--min_se_seq
minimum size of subelement seqeunces, in bp, to be output in 'subelements.fasta' file
(default: 100)

-g or --min_gen
minimum number of genomes in which a subelement must be present to be included in csv output
(default: 1)

-v or --verbose
verbose output

--license
print license information and quit

Figure Output Option: (Requires gnuplot)

-p or --graph
Output graphical representations of AGEs and their distributions in the input data sets. See Output Files section below for more information.
(default: no figures will be output)

--gnuplot
path to gnuplot executable
(default: will search for gnuplot first in the 'bin' directory, then in PATH)

--graph_se
plot subelement dividers in output figures
(default: no subelement dividers will be plotted)

--g_se_min
if subelement dividers are requested, mimumum subelement size, in bp, to plot
(default: 20)

--g_type
output type. Choices are 'png' or 'pdf'. PDF output requires that gnuplot was compiled with pdfcairo terminal.
(default: png)

Result Confirmation Options: (Requires bwa)

Whole-genome assemblers using short reads can sometimes omit sequences that are present in the read set, but are not assembled into contigs in all genomes. ClustAGE allows the option to align raw reads to the set of AGEs identified by ClustAGE to determine if AGEs missing from the assemblies of some of the included genomes can be found in the reads. This process is only additive, i.e. AGEs will only be added to a genome's AGE profile based on read alignments, never subtracted. Subelement sequences identified by read alignment will be identified in a separate set of files identified as 'read_corrected'.
WARNING: This step can be VERY slow, but is recommended for draft genome sequences produced using de novo assembly of short reads (i.e. Illumina, IonTorrent, etc.)

-r or --reads
file with paths to sequencing reads, for confirmation. If sequencing reads are given, distribution of elements among genomes will be confirmed by read alignment.
File format:

genome_name<tab>/path/to/reads.fastq<tab>(optional)/path/to/reads_2.fastq

Read files must be in fastq format
Genome names must EXACTLY match those in the file given to option '-f'
If forward and reverse read files are available for a particular genome, they can be given, in order, separated by a tab
Gzipped read files (ending with '.gz') are allowed

-c or --core
fasta file of sequences considered to be "core" or present in the majority of the input genomes. Can use the "backbone.fasta" file produced by Spine.
If given along with a reads file above, ClustAGE will align reads to the core genome sequence file and note reads aligning to the core. If these same reads are then found aligning to AGEs, they will only be considered a true alignment if the alignment quality is greater than for the alignment against core.
Not required, but recommended to reduce false-positive poor quality alignments.
(default: no core sequence)

-d or --depth
minimum read depth
(default: 5)

--bwa
path to bwa executable
(default: will search for bwa first in the 'bin' directory, then in PATH)

-t or --threads
number of threads (for bwa only)
(default will be automatically determined based on number of available CPUs)

5. OUTPUT FILES:

command.txt
Version numbers of ClustAGE, version numbers of support software used by ClustAGE, and list of parameters given to ClustAGE

AGEs.key.txt
Characteristics of accessory genomic element (AGE) representatives

Column header	Description
`bin_id`	Unique identifier given to the representative sequence. These IDs correspond to the sequence IDs in the "AGEs.fasta" file.
`source_id`	ID of the sequence that served as the source for this representative
`source_genome`	Genome name for the source sequence
`source_length`	Length of the source sequence, in bases
`bin_start`	Start coordinate of the region on the source sequence that corresponds to this representative (1-based)
`bin_stop`	Stop coordinate of the region on the source sequence that corresponds to this representative (1-based)
`bin_length`	Length of the representative accessory region, in bases

AGEs.fasta
Nucleotide sequences of the representative AGE sequences output by ClustAGE. Original sources of the sequences are given on the ID line or can be determined by cross-referencing with AGEs.key.txt file.

AGEs.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within representative accessory regions.

Column header	Description
`bin_id`	Unique identifier given to the representative sequence. Corresponds to sequence headers in AGEs.fasta file and bin_ids in AGEs.key.txt file
`annotation(s)`	Comma-separated list of genes within the AGE. Each entry takes the form of locus ID followed by the percentage of the gene contained within the AGE (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: `PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein"`

subelements.key.txt
Characteristics of subelements of AGEs. Subelements are subdivisions of AGEs based on distribtion of parts of the AGE among the strains being examined. Subelement borders occur points where there are changes in the group of strains in which a discrete part of the representative AGE is found.

Column header	Description
`subelement`	ID of the subelement section.
`bin_id`	ID of the AGE from which the subelement was derived
`source_id`	ID of the sequence that served as the source for the AGE
`source_genome`	Genome name for the source sequence
`start`	start coordinate of the subelement section along the AGE (1-based)
`stop`	stop coordinate of the subelement section along the AGE (1-based)
`length`	length of the sublement
`avg_rank`	if ranking information was provided, this is the average of the rank values for ranked genomes that contain this subelement
`num_genomes`	total number of genomes in which this subelement was identified
"genome_name (rank)"	All subsequent columns will show the presence (1) or absence (0) of the subelement

subelements.fasta
Nucleotide sequences of subelements sequences output by ClustAGE. By default, ClustAGE only outputs subelement sequences at least 100 bp in length. This can be adjusted using the --min_se_seq option.

subelements.annotations.txt (if annotation files were included as input to ClustAGE)
Genes contained within subelement regions.

Column header	Description
`subelement`	Unique identifier given to the subelement. Corresponds to sequence headers in subelements.fasta file and subelements.key.txt file
`annotation(s)`	comma-separated list of genes within the subelement. Each entry takes the form of locus ID followed by the percentage of the gene contained within the subelement (by nucleotide length) in square brackets, followed by the gene product in double quotation marks. Example: `PA2185[100.00%]"non-heme catalase KatN",PA2186[100.00%]"hypothetical protein"`

subelements.csv
Comma-separated list of subelement distributions among the included genomes.

First column is the genome name.
Second column is the rank of genome (if given)
Each subsequent column corresponds to a subelement named in the first row. If the subelement is present in a particular genome, this will be indicated by a 1. Absence of a subelement is indicated by a 0.

subelements.alignments.txt
Sources of sequences in each genome containing subelements.
Column headers and descriptions:

First column (subelement) is the sublement ID
Each subsequent group of three columns corresponds to one of the included genomes
- Column 1 of 3 (_contig): ID of the source sequence on which the subelement is found. If the subelement was not found in this genome, the value will be "-"
- Column 2 of 3 (_start): start coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0.
- Column 3 of 3 (_stop): stop coordinate of the subelement in the source sequence (1-based). Negative numbers indicate the sequence was found on the reverse strand of the sequence. If the subelement was not found in this genome, the value will be 0.

graphs folder
Contains graphical representations of distributions of each AGE output by ClustAGE. Each file corresponds to one AGE as indicated by the filename. If annotation information for the AGE representative genome was given, this will be shown at the top of the figure with genes on the forward strand shown as green arrows and genes on the reverse strand shown as orange arrows. If a gene begins and/or ends outside the boundaries of the AGE, this will be indicated by a vertical dashed line at the beginning or end of the line. The AGE in the reference genome will be a red bar. Presence of the AGE or portions of the AGE will be indicated in blue on the lines corresponding to the genome. If read confirmation was selected and read alignment revealed AGEs not present in the assembled sequences, these regions will be indicated with green bars. The intensity of color in the bars corresponds to the sequence identity of the alignment as indicated by scale bars on the right side of the figure.

ClustAGE Graph Example
Figure 2: Example AGE graph output

6. UTILITIES

These scripts are located in the 'utilities' directory included with ClustAGE.

6.1 subelements_to_tree.pl

This pipeline script calculates a Bray-Curtis distance matrix from distributions of accessory elements that it uses to create a neighbor joining tree of accessory element distribution patterns. Note these distances are not based on sequence similarity, but only on presence or absence of an accessory element within a genome within the threshold parameters given to ClustAGE. It will also produce output files that can be used to create a heatmap of Bray-Curtis similarity values.

Software:

The phylip executables 'neighbor', 'seqboot', and 'retree' are required. See 3. Installation above for instructions on downloading and installing these components.

The directory 'stt_support' must be in the same directory as subelements_to_tree.pl

Optional: You may want to use FigTree or a similar viewer to view and manipulate the intermediate tree.

Optional: Tree and heatmap files can also be viewed online using iTOL (http://itol.embl.de/)

Usage:

perl subelements_to_tree.pl -c clustage.subelements.csv -k clustage.subelements.key.txt

Required Inputs:

Argument	Flag	Value	Example	Comment
ClustAGE subelement csv file	`-c`	file name	`-c clustage.subelements.csv`	Can be the read-corrected csv file, if available
ClustAGE subelement key file	`-k`	file name	`-k clustage.subelements.key.txt`	This can be the read-corrected key file, if available

Optional Inputs:

Argument	Flag	Value	Example	Comment
Minimum subelement size	`-s`	integer	`-s 100`	Default: 100
Number of bootstraps	`-b`	integer	`-b 100`	Default: 100. Can be set to 0 to turn off bootstrap calculation
Collapse branches	`-d`	float	`-d 0.5`	Collapse branches with bootstrap support below the value given. Must be a number betwen 0 and 1. Default: 0 (i.e. all branches will be shown)
Output file prefix	`-o`	string	`-o output`	Default: "output"
Leaf order rearrangement	`-r`	string	`-r midpoint`	See Note 1 below for description of options. Default: "midpoint"
Keep intermediate tree files	`-x`	no value	`-x`	Keeps unbootstrapped tree and bootstrap trees. Default: intermediate tree files will be deleted
Path to phylip directory	`-p`	directory path	`-p /path/to/phylip-3.69/exe`	This should be the directory that contains the phylip executables 'seqboot', 'neighbor' and 'retree'. Default: will look first in ClustAGE/bin folder or in PATH

Note 1
Options for -r:

"midpoint" : (default) Midpoint root the tree such that the root is equidistant from the two farthest points on the tree
"user" : Program will pause after creating the initial tree and provide instructions in the terminal for manually rerooting the tree using FigTree before continuing.
"none" : No rearrangement of the initial tree will be performed

Outputs:

\<prefix>.tre</prefix>
Newick-formatted neighbor joining tree based on the Bray-Curtis distance matrix of accessory genome differences. If bootstraps were calculated, these will be included as branch labels. The tree can be viewed using FigTree or similar tree viewer or online using iTOL or EvolView.

\<prefix>.sim_matrix.csv</prefix>
Comma-separated table of pairwise accessory genome Bray-Curtis similarity values (1 - Bray-Curtis distance).

\<prefix>.heatmap.txt</prefix>
iTOL heatmap annotation file. If iTOL is being used to view the neighbor joiing tree, this file can be dragged into the tree viewer window of iTOL to produce a heatmap of Bray-Curtis similarities. Under the "Advanced" tab in the "Controls" pane, make sure to set "Leaf sorting" to "None" to ensure the order of tree leaves matches the heatmap.

\<prefix>.BCdist.phy</prefix>
Phylip-formatted matrix of pairwise accessory genome Bray-Curtis distances. Dummy names are given to comply with phylip format name lengths. Key of dummy names to actual genome names is given in the file \<prefix>.dummy_names.txt</prefix>.

subelements_to_tree.pl iTOL Plot Example
Figure 3: Example tree and heatmap output as plotted in iTOL. Tree has branches with less than 0.5 bootstrap support collapsed.

6.2 re-rank.pl

This script will change ranking information in ClustAGE output files. This will save you from having to re-run ClustAGE if you want to examine the same dataset in relation to another phenotype or characteristic.

Software:

Requires gnuplot if re-ordering of figures is desired.

Usage:

perl re-rank.pl -c clustage.subelements.csv -k clustage.subelements.key.txt

Required Inputs:

Argument	Flag	Value	Example	Comment
ClustAGE subelement csv file	`-c`	file name	`-c subelements.csv`	Can be the read-corrected csv file, if available
ClustAGE subelement key file	`-k`	file name	`-k subelements.key.txt`	This can be the read-corrected key file, if available

Optional Inputs:

Argument	Flag	Value	Example	Comment
New ranks	`-f`	file name	`-f new_ranks.txt`	This can be the same format as the file list given to the `-f` input of ClustAGE. See Note 2 below.
Output prefix	`-o`	string	`-o rerank`	Default: "rerank"
Read-corrected csv	`-C`	file name	`-C subelements.read_corrected.csv`
Read-corrected key	`-K`	file name	`-K subelements.read_corrected.key.txt`
Output figures	`-p`	no value	`-p`	Default: No new figures will be output. Requires gnuplot
Annotation file list	`-a`	file name	`-a annotation_list.txt`	Same annotation file list given to ClustAGE. To add gene information to output figures
AGEs key file	`-A`	file name	`-A clustage.AGEs.key.txt`	File output by ClustAGE. Only adding gene information to output figures
gnuplot location	--gnuplot	file name	`--gnuplot /path/to/gnuplot`	Default: will look first in ClustAGE/bin folder or in PATH

Note 2
Description of new ranking file for -f:
File format can be the same as the tab-separated file list given to ClustAGE, i.e.

/path/to/accessory/file<tab>genome_name<tab>rank

The file path (first column) can be left blank as long as there is a <tab> before the genome name. Any characters before the first <tab> in the file will be ignored. </tab></tab>
Only genomes with ranks to be changed need to be included in this file.
Genomes not included will not have their original rank values changed
Genomes that are included but not given a rank value will have ranks changed to "NA"

Outputs:

Will output new subelements.key.txt and subelements.csv files and, if requested, sublements.read_corrected.csv, subelements.read_corrected.key.txt, and graphs folder with the new rankings. See description of ClustAGE outputs above for more information.

7. ClustAGE Plot

Online utility to visualize accessory element distribution patterns throughout the population. See http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage_plot.cgi for instructions and more information.

ClustAGE Plot Example
Figure 4: Example ClustAGE Plot output

8. SUPPORT SOFTWARE

BLAST+ software (blastn and makeblastdb) are provided by the National Library of Medicine / National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov).
Reference: Altschul, S F et al. "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs." Nucleic Acids Research 25.17 (1997): 3389-3402.

bwa: Provided under GNU GPL version 3. See copyright included with bwa for more information.

9. LICENSE:

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. See LICENSE.txt

10. CONTACT:

Contact Egon Ozer with questions or comments.

Written with MacDown.

ClustAGE Code

Branches

Tags

Tree [ef6825] master /

History

Read Me

ClustAGE

CONTENTS:

1. INTRODUCTION:

2. REQUIREMENTS:

3. INSTALLATION:

Optional Software Installation:

4. USAGE:

4.1 Required Inputs:

4.2 Optional Inputs:

Figure Output Option: (Requires gnuplot)

Result Confirmation Options: (Requires bwa)

5. OUTPUT FILES:

6. UTILITIES

6.1 subelements_to_tree.pl

Software:

Usage:

Required Inputs:

Optional Inputs:

Outputs:

6.2 re-rank.pl

Software:

Usage:

Required Inputs:

Optional Inputs:

Outputs:

7. ClustAGE Plot

8. SUPPORT SOFTWARE

9. LICENSE:

10. CONTACT:

ClustAGE Code

Branches

Tags

Tree [ef6825] master / Download Snapshot History

Read Me

ClustAGE

CONTENTS:

1. INTRODUCTION:

2. REQUIREMENTS:

3. INSTALLATION:

Optional Software Installation:

4. USAGE:

4.1 Required Inputs:

4.2 Optional Inputs:

Figure Output Option: (Requires gnuplot)

Result Confirmation Options: (Requires bwa)

5. OUTPUT FILES:

6. UTILITIES

6.1 subelements_to_tree.pl

Software:

Usage:

Required Inputs:

Optional Inputs:

Outputs:

6.2 re-rank.pl

Software:

Usage:

Required Inputs:

Optional Inputs:

Outputs:

7. ClustAGE Plot

8. SUPPORT SOFTWARE

9. LICENSE:

10. CONTACT:

Tree [ef6825] master /

History