Chengxi Ye, Zhanshan (Sam) Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu
The formal version of our work, titled 'Exploiting Sparseness in de novo Genome Assembly' has been published in BMC Bioinformatics.
http://www.biomedcentral.com/1471-2105/13/S6/S1
For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons
Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.
Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.
Last update: June 17, 2012
A beta version which is more efficient and contains error correction is released. Some preliminary results: NA12878 human genome, (k 31 g 25) memory peak: ~ 23 GB. ERA000206 E. coli, ~600X, (k 51 g 25) memory peak: ~ 120 MB.
If you have suggestions or questions, please contact Chengxi Ye: cxy AT umd.edu.