ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

OfflineAnalysis

The ARCOMEM Offline Analysis Engine

Crawling the web generates large volumes of data. Processing such large
datasets is non-trivial and requires distributed processing to run in any
reasonable amount of time. In addition, the sheer amount of data means that any
data transfer required for the processing itself carries considerable costs.

The MapReduce framework is designed for exactly the kind of problem where you
have a large volume of data that needs processing across a cluster of
machines. The MapReduce framework assumes that the data is held on a
distributed filesystem across the machines making up the cluster. When a
MapReduce job is run, the framework attempts to optimise the assignment of
processing tasks within the job so that tasks are run on the machines that host
the data being processed; pushing the computation to the data rather than
pulling the data to the computation (see the section describing
MapReduce).

The ARCOMEM Offline Analysis Engine has been designed to take full advantage of
the MapReduce framework and operates over the data stored by the crawler across
the cluster in the HBase database. The remainder of this document aims to
describe the components that make-up the Offline Analysis Engine and to
describe how new offline analysis modules can be implemented and run.

Components of the Offline Analysis Engine

The Offline Analysis Engine consists of two main types of component: analysis
modules and command-line tools for running the modules. There are four types of
supported module:

Standard modules which operate as Map-Reduce tasks over the resource
table created by the crawler; HBase modules which operate as Map-Reduce
tasks over arbitrary HBase tables; HDFS modules which perform their
**operations as Map-Reduce tasks over arbitrary files stored on the Hadoop
Distributed File System (HDFS); and Local modules which don\'t run in
**distributed mode but can push and pull information to/from the HBase, HDFS
**and/or the knowledge base (triple store).

Most modules that need access to the crawled data will be implemented as
standard modules. Modules that just need to talk with the knowledge base will
be implemented as local modules. The HDFS and HBase module types are included
so any unforeseen processing scenarios can be handled.

There are two command line tools included in the engine:

SingleOfflineProcessRunner: a tool for running a single analysis module;
and OfflineProcessRunner: a tool for running an entire configurable
**pipeline of analysis modules.

These tools can be used in an adhoc fashion for testing modules but also form
the basis for running the offline process within the ARCOMEM system as a whole.