CloudBurst for Computer Scientists
I was recently asked to describe CloudBurst for someone that was a computer scientist, and had experience with MapReduce & Hadoop, but little experience with biological sequence analysis.
Here is my response:
First, let me try to describe the program with an analogy.
Say you have a large text, like all of the pages of the wikipedia, and you wanted to find all references to you by your first name (Shirley), and your 1000 closest friends. You could just find references that have exactly your name, but since the wikipedia has typos and other errors in it, you might want to find references to your name with some error in it like 'Sherley', 'Shirly', or 'Shirrly' or other misspellings of your name. Notice there are 3 types of errors to consider: substitution errors (i/e in Shirley/Sherley), deletions (e/- in Shirley/Shirly), and insertions (-/r in Shirley/Shirrly with a deletion of the e).
You could try entering a set of regular expressions into the search box to try to match all possible variations of your name. If '?' matches any single character then the following set of 7 regular expressions will cover all possible single substitution errors: Shirle?, Shirl?y, Shir?ey, Shi?ley, Sh?rley, S?irley, ?hirley. In the end you would have to scan the wikipedia 7 times, for your name and 7 times for each of your 1000 friends names. Pretty quickly that becomes a very number of passes through the wikipedia.
Instead, lets split your name into 2 parts, the left hand side (Shir) and the right hand side (ley). Notice if your name appears someplace in the wikipedia with at most 1 error, then either the left hand side or the right hand side *must* occur without an error. There are 2 parts, but only 1 error, and there is no way 1 error can interfere with both parts. Notice in each of the regular expressions above, either 'Shir' or 'ley' occurs without a question mark. Now say you had preprocessed the wikipedia and built a catalog of every page and position that 'Shir' or 'ley' occurred. Then, to find occurrences of 'Shirley', your program could use that catalog to quickly examine the next 3 characters after each place 'Shir' occurs. If the 3 chacters were within 1 error from 'ley', your program could report an (approximate) occurrence of 'Shirley' at position XYZ of page ABC of the wikipedia. It would then also check every position 'ley' occurs and see if the proceeding 4 characters were within 1 error of 'Shir'. Given the catalog of where the "parts" occur, your program only has to explicitly check regions that could possibly have an occurrence and skip over (the vast majority) of the wikipedia that doesn't. The same catalog can be used for your friends names too, without any changes to the program and without any additional scans of the reference.
In CloudBurst, the large text is a reference genome. The file s_suis.br has the complete genome for a bacteria called Streptococcus suis, which is about 2 million DNA characters long (a much smaller genome than the human genome, but sufficient for ensuring everything is working correctly). Instead of names, 100k.br is a collection of 100,000 short DNA sequences called reads, each 36 characters long (they are real sequencing reads, but just a small fraction of what would be needed to study the human genome). The goal is to find where each read occurs in the genome allowing for a fixed amount of errors: up to K substitution errors, or up to K substitution, insertion, or deletion errors depending on the command line arguments. This processes is often called read mapping, because you try to 'map' the read to the reference.
The map function of the first MapReduce cycle does one of two things, depending on if the input sequence is a read or reference. For the reads, it splits the read into K+1 parts, where each part is the same length: 36/(K+1) characters long. Call each of these parts a seed, because they fix the position as you check for an end-to-end mapping ('Shir' or 'ley' were seeds in the wikipedia search, although in CloudBurst they would all be the same length). The map function emits the key value pair (seed, MerInfo), where seed is a DNA sequence and MerInfo is a tuple with information describing where the seed came from. Since the seeds are all the same length, they are also called mers, where a mer is a fixed length substring of DNA characters. The map function will emit a total of (K+1) seeds for each read. As a technical note, since we don't know the orientation of a read (if it was on the 5' or 3' DNA strand), the reverse complement of each read is also processed and (K+1) additional reverse complement seeds are emitted with isRC=1 in the MerInfo.
If instead the sequence was a reference, it also emits the key-value pairs (Seed, MerInfo), but now it emits the seed from every position along the reference sequence. If the reference text was 'Shirley studies computer science' and K was picked to have 3 character seeds, it would emit seeds for 'Shi', 'hir', 'irl', 'rle', 'ley' and so forth. DNA sequences do not have any punctuation, so it has to emit all length |s| mers starting at every postion. All together it emits |R|-|s|+1 reference seeds where |R| is the length of the reference, and |s| is the length of the seed.
The sort/shuffle phase groups together seeds that have the same DNA sequence, without regard to if the seed is from a reference or read. At the end of the phase there will be one group of values for each seed thus building a catalog of where each seed occurs in the reference and read sets.
The reduce function is called once for each seed with a list of MerInfo records. If the seed occurs R times in the reference sequence and Q times in the read set it will evaluate a total of RxQ total potential mappings by counting errors at the proceeding/succeeding characters in the read and reference for each seed. If the end-to-end mapping has at most K errors, it will report the mapping to a file.
It is possible to split up the evaluation of seeds, as long as the entire set of seeds is eventually evaluated. For example, some simple sequence seeds (like 'AAAAA') occur a very large number of times in the reference or read sequences and unbalances the reduce jobs. This is handled by redundantly emitting seeds to further partition those seeds. For example when 'AAAAA' is emited from the reference with redundancy set to 4, the seeds AAAAA-0, AAAAA-1, AAAAA-2, and AAAAA-3 are emitted with the same MerInfo records. Then when 'AAAAA' is emitted from a read, it is randomly assigned to one of the 4 redundant seeds. Ultimately all RxQ potential mappings are evaluated, but the work is split over 4 reduce function calls, each with R seeds and about .25xQ seeds.
The second optional MapReduce cycle filters the alignments to report the unambiguous best mapping for each read. Say in the wikipedia example, the first MapReduce cycle finds exactly one place with 'Shirley', but many other places with 'Sherley' or 'Shirly' or other off-by-one misspellings. In this case, the single occurrence of 'Shirley' is the unambiguous best mapping. If there were 2 occurrences of 'Shirley', then there is no unambiguous best alignment (there are 2 tied for best). Its a simple filtering protocol to determine how good a mapping is- some reads will occur from highly repetitive portions of the genome, and may map equally well to 100s or 1000s of positions. It is very difficult to analyze these reads, so the filtering excludes reads and mappings that are repetitive. The map function re-emits the mapping information, but with the read identifier as the key. The sort/shuffle phase groups together all alignments with the same read identifier, and the reduce function scans the list of mappings and keeps track of the best alignments. If an unambiguous best alignment exists, it is reported to a file. This MapReduce cycle could be started immediately after the first reduce function, but there is no mechanism in Hadoop to reshuffle on-the-fly.