Chengxi Ye, Zhanshan (Sam) Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu
Please refer to https://sites.google.com/site/sparseassembler/ for the compiled beta version supporting paired reads.
For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons
Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.
Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.
Tips for testing SparseAssembler:
Parameters:
k: kmer size, support 15~63 (important).
g: sparseness factor, or number of skipped intermediate k-mers, support 1-25, suggest a variety from 10-25 (important).
f: single end reads data, input filename, input multiple files like f ReadFile1 f ReadFile2 f ReadFile3.
GS: estimated genome size, set a larger value to preallocate space for both the real k-mers and the false k-mers (sequencing errors). (You can set it to 10X real genome size to preallocate space for the errors.)
NodeCovTh: coverage threshold for spurious k-mers, support 0-16, suggest 0-3 (important).
EdgeCovTh: coverage threshold for spurious links, support 0-16, suggest 0-2 (important, set to 0 for low-coverage data).
PathCovTh: coverage threshold for merging bubbles in the breadth-first search, support 0-1000 (important).
LD: 1 to load a previously built sparse k-mer graph, so you can set different values for the above 3 thresholds.
Because it is a newly developed showing-the-purpose assembler, it can have some problems. If you have further concerns, please contact Chengxi Ye: cxy AT umd.edu.