Last update: July 7, 2012
Added support for scaffolding.
The formal version of our work, titled 'Exploiting Sparseness in de novo Genome Assembly' has been published in BMC Bioinformatics.
http://www.biomedcentral.com/1471-2105/13/S6/S1
For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons
Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.
Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.
Updates:
Last update: July 7, 2012
Added support for scaffolding.
June 20, 2012
A beta version which is more efficient and contains error correction is released. Some preliminary results: NA12878 human genome, k 31 g 25 memory peak: ~ 15 GB, k 51 g 20, memory peak: ~20GB. ERA000206 E. coli, ~600X, (k 51 g 25) memory peak: ~ 120 MB.
If you have suggestions or questions, please contact Chengxi Ye: cxy AT umd.edu.