SplitMEM Code

Graphical pan-genome analysis with suffix skips

Brought to you by: hayanlee, mcschatz, smarcus

Tree [bfff76] master / History

HTTPS access

File	Date	Author	Commit
support	2014-11-20	Michael Schatz	[7e8a4d] add support Makefile
test	2014-11-20	Michael Schatz	[bfff76] Update to new output format
Makefile	2014-11-20	Michael Schatz	[8ce897] specific c++ standard
README	2014-04-08	Shoshana Marcus	[6b3416] updated README
splitMEM.cc	2014-11-20	Michael Schatz	[9cccb5] print help if no arguments are passed in

Read Me

SplitMEM: Graphical pan-genome analysis with suffix skips
Shoshana Marcus, Hayan Lee, and Michael Schatz
April 2014
=========================================================

This program builds a compressed de Bruijn graph to represent a fasta file by traversing a suffix tree of the genome.  The suffix tree is marked with Maximal Exact Matches (MEMs) in the genome and augmented with suffix skips.  After the initial preprocessing  to build the suffix tree, the program can work with several sizes of kmers without rebuilding the suffix tree.


Compiling
============
make


Dependencies
============
None.


Input
============
The input file is a fasta or multi-fasts file with no ambiguity codes.


Options
============
-file <file>
	Load sequence from indicated fasta file

-mem <len>
	Locate MEMs at least this long

-manyMEMs <file>
	File of minimum MEM lengths  

-cdg <out>
	Filename of compressed de Bruijn graph 

-multiFa 
	Indicates the input file is a multi-fasta file for pan-genome analysis.  In this mode, the program has an additional output file, <out>fastaPos.txt that indicates the positions at which a genome begins in the multi-fasta file.  

-printGenome <outFasta> 
	optional.  prints the fasta file as it is represented in memory and indexed by the suffix tree 

One of the options -mem and -manyMEMs must always be specified.

Ordinarily, -cdg is followed by the name of the file that the compressed de Bruijn graph will be written to.
In -manyMEMs mode, -cdg provides the prefix for the output dot files.  The prefix will be followed with the kmer length and the .dot extension in naming the output files.


Output
============
The output is a directed graph in .dot file format representing the compressed de Bruijn graph for the genome.  The nodes are labeled by comma separated start positions then a colon and length in base pairs.  In the case that a set of kmer lengths are specified, several dot files are created as output.

In multi-fasta mode, the program has an additional output file, <out>fastaPos.txt that indicates the positions at which a genome begins in the input file.  This is useful for interpreting the output dot file in terms of which start positions belong to which genome.


Example
============
For a single kmer length of 50:

./splitMEM -file ecoli.fa -mem 50  -cdg ecoli.dot  



For many kmer lengths:

./splitMEM -file ecoli.fa -manyMEMs kmerLens.txt  -cdg ecoli 


Also see test/test.sh for examples with expected outputs



=========================================================

Utilities

=========================================================

classifyNodesByStrainFrequency


useful for analyzing a graph created in multi-fasts mode.  This utility augments the node labels with the number of strains that a node encompasses.  This information is appended to the node label, separated by a semicolon.


Compiling
============
g++ -O3 -o classifyNodesByStrainFrequency classifyNodesByStrainFrequency.cpp


Dependencies
============
None.


Input
============
The input is  a dot file containing a compressed de Bruijn graph or a file containing only the node labels.


Options
============
-dot <dotFile>                            
	input dot file to modify

-nodeLabels <nodeLabelFile>               
	input file of only node labels

-starts <file>                            
	filename that contains offsets of fasta files in multi-fasta file

-numGenomes <n>                           
	number of strains in pan-genome

-k  <len>                         
	kmer length used in input graph

-printModDot <outDotFile>              
	prints input dot file augmenting nodes with number of strains they occur in after semicolon

-printModNodeLabels <outNodeLabelFile> 
	appends to each node label the number of strains the node occurs in, after semicolon

-printGenomeCoords <strainNum> <outputCoordFile>      
	prints coordinates of nodes with strain frequency for specified genome in set, first is 1


Output
============
Dot file augmented with number of genomes occurring in each node.
Can also print the coordinates of the nodes for a specific genome in the multi-fasts file.

See printModDot, printModNodeLabels and printGenomeCoords above


Example
============
see runEcoliCoreGenomeCalc.sh


=========================================================

graphCoreGenomeDist.py


Dependencies
============
Python and NetworkX


Input
============
The input is a dot file with a compressed de Bruijn graph whose nodes are augmented to include the number of genomes that occur in each node.
Can first simplify graph with simplifyDot.sh to remove multi-edges to improve the runtime of this script.


Options
============
--path <dot> 
	input dot file to analyze

--k <len> 
	kmer length in input graph

--numGenomes <num>
	number of genomes in the underlying multi-fasts file

 --p <percent>
	percentage of genomes that defines core genome


Output
============
File with node distance distribution to core genome.  
3 columns: node id, distance to core genome, node length


Example
============
see runEcoliCoreGenomeCalc.sh


=========================================================

mergeFasta.sh

creates a multi-fasts file from several fasta files

=========================================================

simplifyDot.sh

converts a set of compressed de Bruijn graphs to a set of simple graphs with no multi-edges  


=========================================================

runEcoliManyStrains.sh

runs splitMEM for E. coli in multi-fasts mode 


=========================================================

runEcoliCoreGenomeCalc.sh

runs graphCoreGenomeDist.py

SplitMEM Code

Graphical pan-genome analysis with suffix skips

Branches

Tree [bfff76] master / Download Snapshot History

Read Me

Tree [bfff76] master /

History