SparseAssembler Home

Brought to you by: yechengxi

Home

Authors:

There is a newer version of this page. You can find it here.

Project Admins:

Chengxi Ye

SparseAssembler: Sparse k-mer Graph for Memory Efficient de novo Genome Assembly

Chengxi Ye, Zhanshan (Sam) Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu

The formal version of our work, titled 'Exploiting Sparseness in de novo Genome Assembly' has been accepted by Recomb-seq 2012 conference (in Barcelona) and will be published in BMC Bioinformatics.

For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons

Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.

Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.

Tips for testing SparseAssembler:

Parameters:

k: kmer size, support 15~63 (important).

g: sparseness factor, or number of skipped intermediate k-mers, support 1-25, suggest a variety from 10-25 (important).

f: single end reads data, input filename, input multiple files like f ReadFile1 f ReadFile2 f ReadFile3.

GS: estimated genome size, set a larger value to preallocate space for both the real k-mers and the false k-mers (sequencing errors). (You can set it to 10X real genome size to preallocate space for the errors.)

NodeCovTh: coverage threshold for spurious k-mers, support 0-16, suggest 0-3 (important).

EdgeCovTh: coverage threshold for spurious links, support 0-16, suggest 0-2 (important, set to 0 for low-coverage data).

PathCovTh: coverage threshold for merging bubbles in the breadth-first search, support 0-1000 (important).

LD: 1 to load a previously built sparse k-mer graph, so you can set different values for the above 3 thresholds.

If you have suggestions or questions, please contact Chengxi Ye: cxy AT umd.edu.