ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

MapReduce

Processing Using MapReduce

MapReduce is a software framework for distributed computation. MapReduce was introduced by Google in 2004 to support distributed processing of massive datasets on commodity server clusters. In a MapReduce system, the data being processed is distributed across the disks of the cluster nodes and the processing task is pushed to the nodes where the data is stored. This is in contrast to more traditional distributed frameworks where the data is pushed to the computation nodes.

Logically, the MapReduce computational model consists of two steps, Map and Reduce, which are both defined in terms of <key, value> pairs. The dataset being processed is also considered to consist of <key, value> pairs. For example, the keys could be filenames and the values could be the actual contents of the files; or the keys could be the line number of a text file and the value could be the text on the respective line. In some cases, the key or value isn\'t important and could be Null.

The Map function takes a <key, value> pair as input and emits a list of <key, value> pairs:

Map(key_in, value_in) → [<key_out(1), value_out(1)>, ..., <key_out(n), value_out(n)>]

Keys emitted by the Map function do not have to be unique; the same key may be emitted multiple times with the same or different values. The Map function is applied to every item in the input dataset. The Map function only considers the current item being processed and is independent of the other items. This means that it is potentially possible to apply the Map function to all the data items in parallel. The output from all the parallel Maps is sorted and grouped (combined) by key, creating a <key(i), [value(1), ..., value(n)]> pair for each unique key.

Each grouped <key(i), [value(1), ..., value(n)]> is then processed by a Reduce function. In the original MapReduce paper, the Reduce function returned a list of values:

Reduce(<key(i), [value(1), ..., value(n)]>) → [value_out(1), ..., value_out(n)]

Modern interpretations of MapReduce, such as the Hadoop framework used in ARCOMEM, are more flexible and allow the reduce function to emit multiple <key, value> pairs (allowing for the same key to be emitted multiple times):

Reduce(<key(i), [value(1), ..., value(n)]>) → [<key_out(1), value_out(1)>, ..., <key_out(n), value_out(n)>]

For both cases, the number of reduce functions that can be performed in parallel is limited by the number of unique keys output by the map functions. Modern frameworks, such as Hadoop, also allow the possibility to specify an extra task called a Combine that occurs after the Map function but before the keys are sorted, merged and passed to Reduce. Functionally, Combine has the same semantics as Reduce but it is run directly on data emitted from a number of Map calls before it is written to disk. A common principle use of Combine is to reduce the amount of data that needs to be passed to Reduce.

The MapReduce computational model is coupled with a distributed filesystem and worker processes spread across cluster nodes in order to complete the framework. This arrangement has a number of desirable features:

Fault tolerance: File system blocks are replicated across nodes in the cluster. If the disk of a node fails then the data is still intact. If a Map or Reduce function suffers a failure it can be re-run on a different machine without having to restart the entire job. The framework is completely resilient to nodes becoming unavailable and re-available, for example due to maintenance.
Data locality: The framework tracks where each block of data is stored and uses this information to intelligently minimise network traffic by performing tasks on nodes where the input data is stored. Modern implementations can be `rack aware\' and will attempt to minimise the traffic between nodes in different racks in preference to minimising between nodes in the same rack.
Job scheduling: Jobs are submitted through a master node which tracks which workers are available and which are busy. The scheduler ensures that the cluster is utilised efficiently by balancing data locality and node utilisation.
Scalability: The framework is massively scalable; processing can potentially be performed on as many machines as there are data records. The practical limitation is the cost of building and running the cluster. The framework allows new nodes to be added to the cluster.

MapReduce in ARCOMEM

In the ARCOMEM offline processing phase, modules (with the exception of local modules) are all implemented as MapReduce jobs. Most of the modules involve applying some processing to documents crawled by the web-crawler, and are implemented as standard modules. Standard modules are run directly over the HBase table populated by the crawler. In a standard module, the input key of each Map function is the URL of a document, and the input value is a ResultResourceWrapper that represents the data and metadata of the web object belonging to the URL.

Some of the modules, such as the image processing module, only need to process one individual document at a time. Such a process is described as being Map-heavy because it doesn\'t use a reducer at all (in terms of implementation they actually use a NullReducer). Most Map-heavy ARCOMEM modules won\'t even emit anything to the MapReduce framework from within the Map function but will instead rely on writing directly to the knowledge base.

Other modules, such as the GATE NLP module, need to run on collections on documents. These modules are termed Reduce-heavy as they perform most of the analysis in the Reduce function. Reduce-heavy modules typically use a very lightweight Map function implementation to group-together collections of documents by emitting multiple documents with the same key.

Finally, there are also balanced modules which take advantage of both the Map and Reduce functions to perform their computation. An example of this is the module which is responsible for computing the distribution of mime-types across all the documents in a crawl. The mime-type module also demonstrates the use of a Combine function for efficiency. More details are described in the module implementation page.

Wiki: OfflineAnalysis
Wiki: OfflineModuleImpl