SparseAssembler Home

Brought to you by: yechengxi

Home

Project Admins:

Chengxi Ye

SparseAssembler: Sparse k-mer Graph for Memory Efficient de novo Genome Assembly
Chengxi Ye, Zhanshan (Sam) Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu

Last update: Aug 17, 2012
Source code for the beta version is released. K-mer counting is enabled. Several bugs fixed.

The formal version of our work, titled 'Exploiting Sparseness in de novo Genome Assembly' has been published in BMC Bioinformatics.
http://www.biomedcentral.com/1471-2105/13/S6/S1

For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons

Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.

Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.

Updates:

Last update: Aug 17, 2012
Source code for the beta version is released. K-mer counting with the sparse k-mers is enabled. Several bugs fixed.

July 8, 2012
Added support for scaffolding in the beta version. Please use the ExpCov parameter to input the expected coverage of a unique contig, this value can be well determined by looking at the "CovHist.txt" file generated by single end assembly. The first column in the file means the coverage, the second column records the number of k-mers with this coverage. The non-noise peak value should be used for ExpCov.

June 20, 2012
A beta version which is more efficient and contains error correction is released. Some preliminary results: NA12878 human genome, k 31 g 25 memory peak: ~ 15 GB, k 51 g 20, memory peak: ~20GB. ERA000206 E. coli, ~600X, (k 51 g 25) memory peak: ~ 120 MB.

If you have suggestions or questions, please contact Chengxi Ye: cxy AT umd.edu.