Contrail: Assembly of Large Genomes using Cloud Computing
Michael Schatz, Jeremy Chambers, Avijit Gupta, Rushil Gupta, David Kelley, Jeremy Lewi, Deepak Nettem, Dan Sommer, and Mihai Pop
Schatz Lab, Cold Spring Harbor Laboratory
The first step towards analyzing a previously unsequenced organism is to assemble the reads by merging similar reads into progressively longer sequences. New assemblers such as Velvet and Euler attempt to solve the assembly problem by constructing, simplifying, and traversing the de Bruijn graph of the read sequences. Nodes in the graph represent substrings of the reads, and directed edges connect consecutive substrings. Genome assembly is then modeled as finding an Eulerian tour through the graph, although repeats may lead to multiple possible tours. As such, assemblers primarily focus on correcting errors, reconstructing unambiguous regions, and resolving short repeats. These assemblers have successfully assembled small genomes from short reads, but have had limited success scaling to larger mammalian-sized genomes, in part, because they require constructing and manipulating graphs far larger than can fit into memory.
Addressing this limitation, we have developed a new assembly program Contrail, that uses Hadoop for de novo assembly of large genomes from short sequencing reads. Similar to other leading short read assembler, Contrail relies on the graph-theoretic framework of de Bruijn graphs. However, unlike these programs, which require large RAM resources, Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an in depth analysis even for large genomes. Preliminary results show Contrail’s contigs are of similar size and quality to those generated by Velvet when applied to small (bacterial) genomes, but provides vastly superior scaling capabilities when applied to large genomes. We are also developing extensions to Contrail to efficiently compute a traditional overlap-graph based assembly of large genomes within Hadoop, strategy that will be especially valuable as read lengths increase beyond 100bp.
Contrail enables de novo assembly of large genomes from short reads by bridging research in computation biology with research in high performance computation. This combination is essential in light of the large data sets involved, and has the potential to unlock discoveries of critical magnitude. Whereas the published analysis of the African and Asian human individuals used read mapping to discover conserved regions and regions with small polymorphisms, de novo assembly has the unique potential to also discover large scale polymorphisms between these individuals and the reference human genome. Mapping the large-scale differences is an important step towards better understanding of our own biology, and may reveal previously unknown characteristics of the human genome related to health or disease. Furthermore, a short read assembler for large genomes is also essential for sequencing the vast numbers of complex organisms that have never been sequenced before, and will directly contribute to new biological knowledge.
- Feb 06, 2012 - Repository converted to git
- Oct 13, 2010 - Sourcecode posted to SVN
- Nov 12, 2009 - Wiki Online
The massive volume of data and short read lengths from the next generation DNA sequencing machines has spurred development of new short read genome assemblers. Genome assemblers created for traditional Sanger sequencing, such as the Celera Assembler are not engineered to work with these data characteristics, and their algorithms are difficult to scale. Instead several of the new assemblers, such as Velvet and Euler-USR, model the assembly problem as constructing, simplifying, and traversing the de Brujin graph of the read sequences. Nodes in the de Brujin graph represent k-mers in the reads, and directed edges connect consecutive k-mers. Under the assumption that the reads fully sample the sequence of the genome without significant errors, genome assembly is modeled as finding an Eulerian tour through the deBrujin graph that incorporates every edge of the graph.
This framework has several advantages for short read data including efficient computation of overlapping reads and robust handling of sequencing errors. Furthermore, unlike Hamiltonian paths, Eulerian tours can be computed in linear time, although there may be a large number of Eulerian tours for a given graph. Recent genome assemblers using this model have avoided this last issue by focusing on the regions of the de Brujin graph that have unique reconstructions, similar to the heuristics employed by overlap-layout-consensus based assembler to reconstruct the unambiguous regions of the overlap graph. The short read assemblers have demonstrated success for small to moderate sized genomes, but have limited success scaling to larger mammalian-sized genomes, it part, because they require constructing and manipulating graphs far larger than can fit into memory.
See Algorithms for more information.
Before the construction of deBrujin graph, error correction is a good idea. Its leads to simplification of the graph constructed in the later stages due to removal of untrusted reads. There can be two possible stages to Pre Processing:
1. Concatenating mate pairs to give rise to longer (extended) reads
2. Correcting the actual reads based upon figuring out what the trusted Kmers are.
Serial implementations of software that let us apply these stages of preprocessing (FLASH  for obtaining extended reads and QUAKE  for read correction) are always an option. However, their efficiently is severly impaired when the length of the k-mers start increasing. Therefore, a possible parallel invocation of the above programs can be extremely beneficial. The sections below summarize our implementation of parallel preprocessing on fastq files. Following this, we build the deBrujin graph.
Parallel FLASH Invocation
This involves (a) Writing out the FastQ files on individual cluster nodes (b) Invoking FLASH on these locally written files to generate extended fragments (c) Writing the extended files back into the HDFS.
Parallel QUAKE Invocation
This involves (a) Counting Frequency of Kmers (b) Calculating a cutoff value based upon k-mer frequency (all kmers below the cutoff value are classified untrusted) (c) Pruning the Kmer Count File to only contain trusted Kmers with freuency above the cutoff value (d) Writing out the bulky Kmer count file onto individual cluster node and constructing partial bithashes (e) Joining partial bithashes from all the participating cluster nodes to create a global bithash (f) Making the global bithash available to all the cluster nodes (g) Writing out the individual FastQ files onto local storage of cluster nodes and correcting them using the global bithash (h) Copying the corrected files back onto the HDFS.
The deBrujin graph is naturally encoded as key-value pairs for map and reduce functions. The MapReduce-based construction algorithm scans each read and emits the key-value pairs (u, v) to encode the edge from the consecutive kmers u and v in the read. After the map function completes, the internal sort phase groups together edges for the same kmer and completes the construction. The initial graph construction creates a graph with nodes for every kmer in the read set. For the human genome this could be as many as ~3 billion nodes, plus a fraction of nodes created by sequencing error in k-mers not truly present in the genome. Therefore the graph is further simplified and compressed to enable deeper analysis toward the ultimate goal of a single sequence for each chromosome.
Since the graph is distributed over many compute nodes, Contrail then uses a combination of parallel randomized list ranking to recognize and simplify non-branching repeat-free regions of the genome, and parallel network motif finding to correct errors and resolve short repeats. Some of the graph transformations include: (a) Linear subpaths of the de Bruijn graph are simplified into single nodes while maintaining overall topology. (b) Bubble x/y caused by sequencing errors is corrected. (c) A split-decision node r is resolved into 2 copies r1 and r2. (d) If mate-pairs are available, they are bundled and used to construct a linear sequence.
- Jnomics: Cloud-Scale Sequence Analysis
- CloudBurst: Highly Sensitive Short Read Mapping with MapReduce
- Crossbow: Read Mapping and SNP calling in the clouds
- Myrna: Cloud-Scale differential expression of RNAseq
This work was supported in part by NIH grants R01-LM006845 and R01-HG004885, DHS award NBCH207002, and the NSF Cluster Exploratory Program (CluE) - grant IIS-0844494.