Warning!! You are strongly encouraged to install one of the
packages of executables and not to try to install the source code.
The only reason to download the source code is if you need to modify
that code for your own purposes. If you work from the source code no
support will be provided. You are entirely on your own. Please
direct bug reports to barryghall@gmail.com.
Compiled executables for Mac and Linux can be downloaded and
should run as is with no additional installations, after making sure
everything has executable permissions. A directory with the source
code is also available, although to run from the source code you'll
need to install the other software kSNP requires and put it in your
path. A detailed User Guide is included. Two example inputs for
testing are also included, with explanation and directions for how to
run them in the User Guide.
Please cite:
Hall, B. G. and J. Nisbet. 2023. Building Phylogenetic Trees from
Genome Sequences with kSNP4. Mol. Biol. Evol. 40 https://doi.org/10.1093/molbev/msad235
Gardner, S.N., T. Slezak, and B.G. Hall. 2015. kSNP3.0: SNP
detection and phylogenetic analysis of genomes without genome
alignment or reference genomes. Bioinformatics 31: 2877-2878 doi:
10.1093/bioinformatics/btv271.
Gardner, S.N. and Hall, B.G. 2013. When whole-genome alignments
just won't work: kSNP v2 software for alignment-free SNP discovery
and phylogenetics of hundreds of microbial genomes. PLoS ONE,
8(12):e81760.doi:10.1371/journal.pone.0081760
Gardner, S.N. and Slezak, T.R. 2010. Scalable SNP analyses of
100+ bacterial or viral genomes. Journal of Forensic Research, 1:107.
*********************************************************************
*********************************************************************
July 28 2023 Version 4.1 released. Major upgrade
Version 4.1 has been tested on Mac OS 10.15.7 (Catalina) and macOS 13.4
(Ventura) and on Ubuntu Linux 22.04. On average kSNP4.1 is 30% faster
than kSNP4, and it uses memory about twice as efficiently as does kSNP4.
October 26, 2022 Version 4.0 released Major upgrade
Version 4.0 released. Depending on the data set, version 4 is up to 3 times
faster than version 3.1.2. Most changes will be transparent to the user.
Version 4 has been tested on Mac OS 10.15.7 (Catalina) and on the following
Linux OS: Ubuntu 16.04, Ubuntu 22.04, Fedora 36 and CentOS Stream 9.
kSNP4.0 requires no programming skills or knowledge. The packages include a
revised kSNP4 User Guide, a guide to downloading genome sequences from NCBI, a
guide to troubleshooting kSNP4, and a BSD Opensource License.
November 9, 2019. Version 3.1.2 released. Minor upgrade
Two utilities added: check_genbank_from_NCBI and fix_old_fasta_headers.
Both utilities are for troubleshooting kSNP3.1 and later. Both are discussed
inf the User Guide for version 3.1.2
Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1
fixes the problems with SNP annotation that arose when NCBI discontinued
use of GI numbers. Please read carefully the Preface (page 3) and the
File of annotated genomes section (pages 9-10) in the version 3.1 User
Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and
to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need
the annotation feature. All users are encouraged to upgrade to version
3.1.
Known issues: Redhat Linux: annotation function requires Redhat
version 7 or above.
July 15, 2016 parse_assembly_summary updated to accomodate NCBI's
modified forrmat of the assembly_summary.txt file for download
genomes by FTP.
July 12, 2019 Bugs were found in NodeChiSquare2Tree3, which is
now replaced by NodeChiSquare2Tree31. Aside from working properly,
kSNP 3.1 NodeChiSquare2Tree31 differs only in that the default
tree_type is parsimony. The documentation accompanying kSNP3.1 now
reflects that change. Users need not users need not reinstall kSNP3.1
to use NodeChiSquare2Tree31. Simply download the separate
NodeChiSquare2Tree31 file, put it into the kSNP3 folder, and discard
the old NodeChiSquare2Tree3 file. Linux users should change the file
name NodeChiSquare2Tree31-linux to NodeChiSquare2Tree31.
June 17, 2016 kSNP3.021 released. All-numeric file names are now
allowed. Thanks to Egon Ozer of Northwestern University Feinberg
School of Medicine for fixing the bugs that led to prohibiting
all-numeric file names.
May 1, 2016 kSNP3.02 released. It was discovered that on some
systems incorrectly naming the input genome sequence files can have
disastrous results that can lead to incorrect SNP counts and
incorrect SNP annotations without the run failing. Thanks to Egon
Ozer of Northwestern University Feinberg School of Medicine for
discovering this important bug. kSNP3.02 now checks the input file
for incorrect names and terminates the kSNP3 run when it finds them.
Please see page 7 of the updated kSNP3.02 documentation for a
description of the naming rules and what to do when illegal names are
detected.
February 10, 2016 kSNP3.01 released. As the result of changes at
NCBI's FTP site for genome sequences the utilities
FetchGinishedGenomes and FetchGenomeAssemblies that were included in
kSNP3.0 no longer work. Thosde utilities have been replaced by
parse_assembly_summary and FTPgenomes .
February 5, 2015 kSNP3 released. v3 has different command line
options, and several major changes that are summarized below.
****This README file is not a substitute for the kNSP3 User
Guide. It is important to read that guide before using kSNP3. The
User Guide describes several new kSNP utilities that facilitate
downloading genome sequences, creation of the Kchooser input file,
etc. It also includes a set of hints intended to simplify life for
kSNP3 users. ****
kSNP3 MAJOR CHANGES from kSNP version 2: 1. Each genome must be
provided in a separate fasta file which can contain multiple reads or
contigs. This differs from kSNP2 where all genomes were in a single
fasta file, which required merging reads and contigs. It also avoids
creating massively large and unweildy fasta files. So you don't need
to run merge_fasta_(reads|contigs) anymore before running kSNP3.
2. The input file in the -in option must contain the full path
location of each genome and the genome name, one line per genome, tab
delimited between full path to genome fasta file in column 1 and
genome name in column 2. This format allows
multi-read,multi-chromosome and plasmid, and multi-contig genomes,
each genome in separate fasta. This allows annotation of sequences
composed of multiple chromosomes, contigs, and plasmids, each of
which has a gi number. The user can edit the genome names by editing
this file instead of editing the fasta files. The SNPs_all file
contains an extra column with the fasta defline of the contig, and
positional information is given relative to that contig.
3. Core and majority trees are calculated using parsimony instead
of maximum likelihood, since simulations indicated that parsimony SNP
trees are more accurate (Hall, 2014, submitted). If you still want to
use ML for core and majority, go into the kSNP3 script and uncomment
the lines where indicated.
4. Calculation of ML, core, and majority trees are now optional.
The default is to only calculate a parsimony tree from the full SNP
matrix.
5. Instead of using only 1 best parsimony tree, it now computes a
consensus parsimony tree from all the trees that tie for the most
parsimonious of the trees created by parsimonator, using "consense"
from PHYLIP modified to allow sequence names up to 100 characters.
6. There is now an option to add genomes to an existing SNP run
instead of doing SNP discovery. It will search for the SNPs already
found in a previous kSNP3 run (specified with the -SNPs_all option)
in the new genomes listed in the -in file.
7. The -u and -c options from kSNP2 are obsolete. Instead, the
code automatically determines which genomes are high coverage raw
reads versus those that are either assembled or low coverage, and
automatically picks the minimum kmer frequency for consideration as a
SNP as a proxy for coverage. It calculates this minimum kmer
frequency from the kmer counts for each genome as the average of the
median and mean kmer count for that genome. This is a heuristic that
allows a flexible kmer count threshold for each genome that depends
on the coverage of any given unassembled genome, and always results
in a threshold of 1 for assembled genomes. This is helpful for
comparing a mix of high and low coverage genomes, such as when some
genomes in the kSNP3 run are low coverage reads extracted from a
metagenome for the species of interest.
#####################
####################################################################################
5/20/2014 Minor errors were corrected in the User Guide. Since
there were no changes to the code, only the User Guide that can be
downloaded separately from the Mac and Linux executables was updated
on sourceforge.
3/31/2014 Minor change in annotation code so that it will
recognize gi numbers when they are in a format gi_448814763_....
Previously, it would only recognize the gi number preceded by a "_"
if it was followed by a space, not any non-digit.
3/30/2014 (still V2.1.2) Recompiled the script NodeChiSquare2tree
in Linux and Mac executables, since the previous compiled versions
were not finding the required perl modules. This is an extra script
the user can call that is not called in the main kSNP code. Fixed
minor errors in User Guide.
3/23/2014 V2.1.2 Modified kSNP wrapper script so that now you
MUST indicate the path to the kSNP executables in that script. This
means that now you do not need to add kSNP directory to your path
environment variable, but you do need to edit this line in the kSNP
file to point to the directory with all the kSNP executables: set
kSNP=/usr/local/kSNP
2/20/2014 V2.1.1 Fixed executable version of label_tree_nodes,
since it was failing to find the a perl module, and as a result the
files containing a tree with labeled nodes was empty.
Table of Contents on User Guide was incomplete, and this has been
corrected.
Permissions are automatically set to 755 for the kSNP file in the
Mac and Linux versions. Before the user needed to make the executable
after downloading.
The above fixes only affect the Linux and Mac executables so I
didn't do a new upload of the source code version.
1/31/2014 V2.1
Fixed annotation bug, since it was failing to annotate many SNPs.
The bug was in the genbank file downloader, so with v2.1 it always
downloads the annotations if they are there. Previously it skipped
the annotations for some gi#'s and so SNPs were incompletely
annotated.
Added a new script NodeChiSquare2tree to assign SNPs to nodes
based on ChiSquare, allowing for imperfect but significant assocation
of SNPs with tree nodes. In some cases, this allows more SNPs to be
mapped to nodes even if there is not a perfect correspondence, e.g.
if the allele is missing in one of the leaf genomes down that branch
or present outside the branch. This should help assign more SNPs to
nodes when draft genomes are included. Look at the User Guide for
more information.
FastTreeMP now prints support values at the nodes, shown in the
tree.ML.tre, tree.parsimony.tre,tree.core.tre, and
tree.majority0.5.tre files. The root may be different than shown in
the other trees with SNP allele counts since kSNP reroots the trees
after the support values have been replaced by node numbers.
Rewrote label_tree_nodes to use bioperl functions instead of text
parsing, for easier labeling when support values are present.
Files tree_nodeLabel.*.tre are kept instead of being moved to
TempFilesToDelete, so the user can run NodeChiSquare2tree.
Added the -c [minimum kmer count] argument to kSNP. This
specifies the minimum number of times a kmer must occur in an
unassembled raw read genome for it to be considered as a SNP locus in
that genome. It defaults to 10. This argument enables the user to
control for sequencing coverage. Note that this count is not exactly
the same as coverage depth, since it will be lower due to bases that
fall near the ends of reads, so do not contain the entire kmer.
10/9/13 Fixed bug that caused kSNP to eliminate very long genomes
(>~2GB, e.g. unassembled genomes) and any subsequently listed genomes
from the analyses.
9/7/13 Added kChooser to identify the optimal value of k prior to
running kSNP. Made it optional to create a .vcf file, since this
script could require alot of RAM.
8/27/13 Added extract_nth_locus script to pull out the nth locus
from the core_SNPs or SNPs_in_majority# file, handy if you're looking
at position n in the core SNPs matrix or SNPs_in_majority matrix and
you want to know what locus it is.
Made kSNP default to not calculate a NJ tree, and added the
command line option -j if the NJ tree is desired. Need to write
faster code to calculate distance matrix from SNPs matrix. Current
code does slow loops that take as long as #SNP loci x # pairwise
combinations of genomes.
6/27/13 Modified select_node_annotations so it will work on a
Mac.
6/6/13 Added trees with no node labels to the results directory.
Modified the SNP_annotations file so that there are fixed columns
for gene, product, notes, etc. and improved memory efficiency of
annotating the SNPs that should help for data sets with over a
million SNPs. Added select_node_annotations script so a user can pull
out the annotated SNP loci which map to a particular user-specified
node of a tree Fixed miscount in the Annotation_summary file