SplitMEM: Graphical pan-genome analysis with suffix skips
Shoshana Marcus, Hayan Lee, and Michael Schatz
April 2014
=========================================================
This program builds a compressed de Bruijn graph to represent a fasta file by traversing a suffix tree of the genome. The suffix tree is marked with Maximal Exact Matches (MEMs) in the genome and augmented with suffix skips. After the initial preprocessing to build the suffix tree, the program can work with several sizes of kmers without rebuilding the suffix tree.
Compiling
============
make
Dependencies
============
None.
Input
============
The input file is a fasta or multi-fasts file with no ambiguity codes.
Options
============
-file <file>
Load sequence from indicated fasta file
-mem <len>
Locate MEMs at least this long
-manyMEMs <file>
File of minimum MEM lengths
-cdg <out>
Filename of compressed de Bruijn graph
-multiFa
Indicates the input file is a multi-fasta file for pan-genome analysis. In this mode, the program has an additional output file, <out>fastaPos.txt that indicates the positions at which a genome begins in the multi-fasta file.
-printGenome <outFasta>
optional. prints the fasta file as it is represented in memory and indexed by the suffix tree
One of the options -mem and -manyMEMs must always be specified.
Ordinarily, -cdg is followed by the name of the file that the compressed de Bruijn graph will be written to.
In -manyMEMs mode, -cdg provides the prefix for the output dot files. The prefix will be followed with the kmer length and the .dot extension in naming the output files.
Output
============
The output is a directed graph in .dot file format representing the compressed de Bruijn graph for the genome. The nodes are labeled by comma separated start positions then a colon and length in base pairs. In the case that a set of kmer lengths are specified, several dot files are created as output.
In multi-fasta mode, the program has an additional output file, <out>fastaPos.txt that indicates the positions at which a genome begins in the input file. This is useful for interpreting the output dot file in terms of which start positions belong to which genome.
Example
============
For a single kmer length of 50:
./splitMEM -file ecoli.fa -mem 50 -cdg ecoli.dot
For many kmer lengths:
./splitMEM -file ecoli.fa -manyMEMs kmerLens.txt -cdg ecoli
Also see test/test.sh for examples with expected outputs
=========================================================
Utilities
=========================================================
classifyNodesByStrainFrequency
useful for analyzing a graph created in multi-fasts mode. This utility augments the node labels with the number of strains that a node encompasses. This information is appended to the node label, separated by a semicolon.
Compiling
============
g++ -O3 -o classifyNodesByStrainFrequency classifyNodesByStrainFrequency.cpp
Dependencies
============
None.
Input
============
The input is a dot file containing a compressed de Bruijn graph or a file containing only the node labels.
Options
============
-dot <dotFile>
input dot file to modify
-nodeLabels <nodeLabelFile>
input file of only node labels
-starts <file>
filename that contains offsets of fasta files in multi-fasta file
-numGenomes <n>
number of strains in pan-genome
-k <len>
kmer length used in input graph
-printModDot <outDotFile>
prints input dot file augmenting nodes with number of strains they occur in after semicolon
-printModNodeLabels <outNodeLabelFile>
appends to each node label the number of strains the node occurs in, after semicolon
-printGenomeCoords <strainNum> <outputCoordFile>
prints coordinates of nodes with strain frequency for specified genome in set, first is 1
Output
============
Dot file augmented with number of genomes occurring in each node.
Can also print the coordinates of the nodes for a specific genome in the multi-fasts file.
See printModDot, printModNodeLabels and printGenomeCoords above
Example
============
see runEcoliCoreGenomeCalc.sh
=========================================================
graphCoreGenomeDist.py
Dependencies
============
Python and NetworkX
Input
============
The input is a dot file with a compressed de Bruijn graph whose nodes are augmented to include the number of genomes that occur in each node.
Can first simplify graph with simplifyDot.sh to remove multi-edges to improve the runtime of this script.
Options
============
--path <dot>
input dot file to analyze
--k <len>
kmer length in input graph
--numGenomes <num>
number of genomes in the underlying multi-fasts file
--p <percent>
percentage of genomes that defines core genome
Output
============
File with node distance distribution to core genome.
3 columns: node id, distance to core genome, node length
Example
============
see runEcoliCoreGenomeCalc.sh
=========================================================
mergeFasta.sh
creates a multi-fasts file from several fasta files
=========================================================
simplifyDot.sh
converts a set of compressed de Bruijn graphs to a set of simple graphs with no multi-edges
=========================================================
runEcoliManyStrains.sh
runs splitMEM for E. coli in multi-fasts mode
=========================================================
runEcoliCoreGenomeCalc.sh
runs graphCoreGenomeDist.py