CloudBurst Wiki

Brought to you by: mcschatz

Hadoop_Performance_Techniques

Algorithms
Profiling
Locality

Algorithms

The first step is to optimize your algorithm- an O(n) algorithm will be much, much, faster than an O(n^2) algorithm in any environment. Similarly, if you have a search problem, ask yourself is there is a better index that you could use. Then consider the overall MapReduce strategy- a map-only program is usually faster than a map+reduce program, and a map+reduce program is usually faster than a program that iterates through multiple MapReduce cycles. For very large datasets, the runtime will dominated by the number of MapReduce cycles, so anything you can do to reduce the number of cycles will be a huge benefit.

Profiling

Once you are happy with the algorithm and design, the next most important step is to optimize the runtime of your program on a single machine. Any benefit you save on a single machine will be multiplied X fold when you run on X machines. A good starting point is to use a profiler to measure where your program is spending time. For example, on Mac I really like Shark for peeking inside my C, C++, or Java program to see where in the code the program spends its time. Other languages and environments have other profilers, such as gprof for C/C++, and Dprof for perl. It doesn't make sense to optimize rarely used functions, so focus on optimizing the most commonly used and slowest functions.

It sounds obvious, but the best way to speed up a slow function is to make it do less. See if you have code that gets executed that doesn't really contribute to the output in a meaningful way and simplify it down to the bare minimum. One common task is to concatenate many strings into a single string. Instead of appending them one at a time, you can often save a lot of time by preallocating a buffer that is exactly the right size, and then copying each string into the buffer at exactly the right offsets. Better yet, instead of copying from one buffer to another, can you compute the same result without copying at all? For higher level languages, you also have to be really careful about "hidden" work, like creating millions of temporary objects that aren't really necessary. Instead of allocating a new object each time, see if there is a way to reuse an object.

Locality

Next, since you are probably using Hadoop to process some huge datasets, you might want to explore basic work reordering techniques for improving locality. For example, CloudBurst had something like

for (int q = 0; q

Wiki: CloudBurst