SparseAssembler Home

Brought to you by: yechengxi

Home

Authors:

There is a newer version of this page. You can find it here.

Project Admins:

Chengxi Ye

SparseAssembler: Sparse k-mer Graph for Memory Efficient de novo Genome Assembly

Chengxi Ye, Zhanshan (Sam) Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu

Please refer to https://sites.google.com/site/sparseassembler/ for the compiled beta version supporting paired reads.

For results & comparisons:
https://sites.google.com/site/sparseassembler/home/results-comparisons

Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.

Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.

Tips for testing SparseAssembler:

Parameters:

k: kmer size, support 15~63 (important).

g: sparseness factor, or number of skipped intermediate k-mers, support 1-25, suggest a variety from 10-25 (important).

f: single end reads data, input filename, input multiple files like f ReadFile1 f ReadFile2 f ReadFile3.

GS: estimated genome size, set a larger value to preallocate space for both the real k-mers and the false k-mers (sequencing errors). (You can set it to 10X real genome size to preallocate space for the errors.)

NodeCovTh: coverage threshold for spurious k-mers, support 0-16, suggest 0-3 (important).

EdgeCovTh: coverage threshold for spurious links, support 0-16, suggest 0-2 (important, set to 0 for low-coverage data).

PathCovTh: coverage threshold for merging bubbles in the breadth-first search, support 0-1000 (important).

LD: 1 to load a previously built sparse k-mer graph, so you can set different values for the above 3 thresholds.

Because it is a newly developed showing-the-purpose assembler, it can have some problems. If you have further concerns, please contact Chengxi Ye: cxy AT umd.edu.