CloudBurst: Highly Sensitive Short Read Mapping with MapReduce
Center for Bioinformatics and Computational Biology, University of Maryland
Bioinformatics 2009 25(11):1363-1369
Next-generation DNA sequencing machines are generating an enormous amount of sequence data, placing unprecedented demands on traditional single-processor read mapping algorithms. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled after the short read mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.
CloudBurst's running time scales linearly with the number of reads mapped, and with near linear speedup as the number of processors increases. In a 24-processor core configuration, CloudBurst is up to 30 times faster than RMAP executing on a single core, while computing an identical set of alignments. In a large remote compute clouds with 96 cores, CloudBurst reduces the running time from hours to mere minutes for typical jobs involving mapping of millions of short reads to the human genome. CloudBurst is available open-source as a model for parallelizing other bioinformatics algorithms with MapReduce.
CloudBurst uses well-known seed-and-extend algorithms to map reads to a reference genome. It can map reads with any number of differences or mismatches using the observation that for an r bp read to align to the reference with at most k differences, the alignment must have a region of length s=r/k+1 called a seed that exactly matches the reference. For example, if a 30bp read aligns to a reference with at 1 difference, there must be a substring of at least 15bp that aligns exactly (if the first base is different, the remaining 29 must match; if the 2nd base is different then the remaining 28 must match; an so forth.) Given an exact seed, CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences by either counting mismatches of the two sequences, or with a dynamic programming algorithm to allow for gaps.
CloudBurst uses the Hadoop implementation of MapReduce to catalog and extend the seeds. In the map phase, the map function emits all length-s k-mers from the reference sequences, and all non-overlapping length-s kmers from the reads. In the shuffle phase, read and reference kmers are brought together. In the reduce phase, the seeds are extended into end-to-end alignments. The power of MapReduce and CloudBurst is the map and reduce functions run in parallel over dozens or hundreds of processors.
The running time of a very sensitive alignment algorithms, such as CloudBurst, are usually very high because there will be many random seed matches to evaluate. However, CloudBurst is designed to scale, and the running time of CloudBurst scales linearly with the number of processors that it runs on, i.e., running on 10 processors takes about 1/10 the time of 1 processor and running on 100 processors takes about the 1/100 the time as 1. Therefore CloudBurst can compute very sensitive mapping quickly even at high sensitivity. Researchers can run CloudBurst on their local Hadoop clusters, or on large remote compute clouds. For example, Amazon sells compute time on their Elastic Compute Cloud (EC2) starting at $0.10 per hour per machine. See [Running_CloudBurst_on_Amazon_EC2] for details.
1. Setup and install Hadoop on your cluster
2. Download CloudBurst
3. Convert your reference and read data to binary format:
$ java -jar ConvertFastaForCloud.jar ref.fa ref.br $ java -jar ConvertFastaForCloud.jar qry.fa qry.br
ConvertFastaForCloud will report some statistics on your data sets. Keep track of the minimum read length in qry.fa, as this value will be needed for step 5.
4. Copy data files into the HDFS: (note /path/to/data is the path within the HDFS)
$ hadoop fs -put ref.br /path/to/data $ hadoop fs -put qry.br /path/to/data
5. Launch CloudBurst (Assume 36bp reads, 3 mismatches, 24 nodes, best-alignment only)
$ hadoop jar /path/to/CloudBurst.jar /path/to/data/ref.br /path/to/data/qry.br \ /path/to/results 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err
The paths /path/to/data/ref.br, /path/to/data/qry.br and /path/to/results are all paths inside of the HDFS. The command captures the status output of CloudBurst to a file called cloudburst.err. You can also monitor the progress of your job in the Hadoop JobTracker.
6. Copy the results back to the regular filesystem, and convert to a tab-deliminated file
$ hadoop fs -get /path/to/results results $ java -jar PrintAlignemnts.jar results > results.txt
The output format is 6 column BED (Browser Extensible Data) format, after the format used by the UCSC Genome Browser. The 6 columns give the following information:
chrom start end name score strand
The fields are:
1. chrom: name of the reference sequence 2. start: start coordinate on reference 3. end: end coordinate on reference 4. name: name of read 5. score: number of mismatches/differences 6. strand: +/- for forward/reverse
See [Sample_Results] for sample data and expected output.
This work was supported in part by NIH grant R01 LM006845 and DHS award NBCH207002. I would also like to thank the generous hardware support of IBM and Google via the Academic Cloud Computing Initiative used in the development of CloudBurst, and the Amazon Web Services Hadoop Testing Program for providing access to the EC2.