MapReduce is a software framework for distributed computation. MapReduce was introduced by Google in 2004 to support distributed processing of massive datasets on commodity server clusters. In a MapReduce system, the data being processed is distributed across the disks of the cluster nodes and the processing task is pushed to the nodes where the data is stored. This is in contrast to more traditional distributed frameworks where the data is pushed to the computation nodes.
Logically, the MapReduce computational model consists of two steps, Map and Reduce, which are both defined in terms of <key, value>
pairs. The dataset being processed is also considered to consist of <key, value>
pairs. For example, the keys could be filenames and the values could be the actual contents of the files; or the keys could be the line number of a text file and the value could be the text on the respective line. In some cases, the key or value isn\'t important and could be Null
.
The Map function takes a <key, value>
pair as input and emits a list of <key, value>
pairs:
Map(key_in, value_in) → [<key_out(1), value_out(1)>, ..., <key_out(n), value_out(n)>]
Keys emitted by the Map function do not have to be unique; the same key may be emitted multiple times with the same or different values. The Map function is applied to every item in the input dataset. The Map function only considers the current item being processed and is independent of the other items. This means that it is potentially possible to apply the Map function to all the data items in parallel. The output from all the parallel Maps is sorted and grouped (combined) by key, creating a <key(i), [value(1), ..., value(n)]>
pair for each unique key.
Each grouped <key(i), [value(1), ..., value(n)]>
is then processed by a Reduce function. In the original MapReduce paper, the Reduce function returned a list of values:
Reduce(<key(i), [value(1), ..., value(n)]>) → [value_out(1), ..., value_out(n)]
Modern interpretations of MapReduce, such as the Hadoop framework used in ARCOMEM, are more flexible and allow the reduce function to emit multiple <key, value>
pairs (allowing for the same key to be emitted multiple times):
Reduce(<key(i), [value(1), ..., value(n)]>) → [<key_out(1), value_out(1)>, ..., <key_out(n), value_out(n)>]
For both cases, the number of reduce functions that can be performed in parallel is limited by the number of unique keys output by the map functions. Modern frameworks, such as Hadoop, also allow the possibility to specify an extra task called a Combine
that occurs after the Map
function but before the keys are sorted, merged and passed to Reduce
. Functionally, Combine
has the same semantics as Reduce
but it is run directly on data emitted from a number of Map
calls before it is written to disk. A common principle use of Combine
is to reduce the amount of data that needs to be passed to Reduce
.
The MapReduce computational model is coupled with a distributed filesystem and worker processes spread across cluster nodes in order to complete the framework. This arrangement has a number of desirable features:
In the ARCOMEM offline processing phase, modules (with the exception of local modules) are all implemented as MapReduce jobs. Most of the modules involve applying some processing to documents crawled by the web-crawler, and are implemented as standard modules. Standard modules are run directly over the HBase table populated by the crawler. In a standard module, the input key of each Map
function is the URL of a document, and the input value is a ResultResourceWrapper that represents the data and metadata of the web object belonging to the URL.
Some of the modules, such as the image processing module, only need to process one individual document at a time. Such a process is described as being Map-heavy because it doesn\'t use a reducer at all (in terms of implementation they actually use a NullReducer). Most Map-heavy ARCOMEM modules won\'t even emit anything to the MapReduce framework from within the Map
function but will instead rely on writing directly to the knowledge base.
Other modules, such as the GATE NLP module, need to run on collections on documents. These modules are termed Reduce-heavy as they perform most of the analysis in the Reduce
function. Reduce-heavy modules typically use a very lightweight Map
function implementation to group-together collections of documents by emitting multiple documents with the same key.
Finally, there are also balanced modules which take advantage of both the Map
and Reduce
functions to perform their computation. An example of this is the module which is responsible for computing the distribution of mime-types across all the documents in a crawl. The mime-type module also demonstrates the use of a Combine
function for efficiency. More details are described in the module implementation page.