Hadoop for Computational Biology
From cloudburst-bio
Here is my response to a question about some of CloudBurst's limitations, and the future of MapReduce/Hadoop in Computational Biology.
Thank you for your interest in CloudBurst. I think CloudBurst is most interesting because as you say it is the first attempt to parallelize a Bioinformatics algorithm with MapReduce/Hadoop. From a system's perspective, a 100x speedup from a couple months work is a fantastic result. However, I also appreciate the user's perspective, and admit CloudBurst doesn't have every feature that every user may need at this time. Paired-end support, quality values in the mapping, colorspace alignment are all at the top of the list for future versions (support for indels has been present since the beginning). I decided to publish without these features only so that others could start to think about MapReduce/Hadoop from an algorithm and systems perspective as quickly as possible.
I hope your users are not discouraged of MapReduce/Hadoop because of CloudBurst's limitations. CloudBurst is just the beginning of the story and not the end. At its core, MapReduce/Hadoop is a massive sorting engine, and many many problems can be solved with it. For day-to-day tasks, I find MapReduce/Hadoop extremely useful for scaling up ad-hoc scripts that can analyze massive 100+GB datasets. For more systematic tasks, we are also developing an entire series of new algorithms based on MapReduce/Hadoop, that directly address the limitations of CloudBurst, and also add entirely new capabilities.
For example, Ben Langmead and I are nearly ready to publish and distribute a cloud version of Bowtie called Crossbow (http://bowtie-bio.sf.net/crossbow). In addition to aligning reads to a reference genome with Bowtie, Crossbow uses MapReduce/Hadoop's massive sort engine to order the alignments along the genome, and then genotype the sample using the program SOAPsnp (http://soap.genomics.org.cn/soapsnp.html). Our goal was to create a pipeline that could quickly and accurately reproduce the analysis of a recent whole genome study (http://www.nature.com/nature/journal/v456/n7218/abs/nature07484.html), and we have accomplished exactly that- Crossbow can genotype a human in about 3 hours at >99% accuracy on a 320 core cluster. As input it aligns a mix of 3 billion paired-end and unpaired reads (110 GB compress sequenced data), and as output it catalogs all the SNPs in the genome. Our tests included running the pipeline on Amazon's EC2, and an end-to-end run costs about $100 dollars. This is undoubtedly a fantastic result from both a users and a systems perspective: it is accurate, fast, and cheap, squeezing 1000 hours of computation into an afternoon all made possible with MapReduce/Hadoop. Now that we are starting to think MapReduce/Hadoop, several natural extensions to Crossbow are apparent, and we are designing extensions to analyze copy number variations, RNA-seq data, Methyl-Seq, ChIP-seq, Structural variations, etc... I'm also nearly done with a MapReduce/Hadoop based de novo assembler that scales to assemble mammalian genomes from short reads.
In short, there is no shortage of opportunities for utilizing MapReduce/Hadoop for computational biology, so if your users are skeptical now, I just ask that they are patient for a little bit longer and reserve judgment on MapReduce/Hadoop until we can publish a few more results.
Thanks again for your interest. Please let me know if you have any questions.
Michael Schatz
