Crawling the web generates large volumes of data. Processing such large
datasets is non-trivial and requires distributed processing to run in any
reasonable amount of time. In addition, the sheer amount of data means that any
data transfer required for the processing itself carries considerable costs.
The MapReduce framework is designed for exactly the kind of problem where you
have a large volume of data that needs processing across a cluster of
machines. The MapReduce framework assumes that the data is held on a
distributed filesystem across the machines making up the cluster. When a
MapReduce job is run, the framework attempts to optimise the assignment of
processing tasks within the job so that tasks are run on the machines that host
the data being processed; pushing the computation to the data rather than
pulling the data to the computation (see the section describing
MapReduce).
The ARCOMEM Offline Analysis Engine has been designed to take full advantage of
the MapReduce framework and operates over the data stored by the crawler across
the cluster in the HBase database. The remainder of this document aims to
describe the components that make-up the Offline Analysis Engine and to
describe how new offline analysis modules can be implemented and run.
The Offline Analysis Engine consists of two main types of component: analysis
modules and command-line tools for running the modules. There are four types of
supported module:
Most modules that need access to the crawled data will be implemented as
standard modules. Modules that just need to talk with the knowledge base will
be implemented as local modules. The HDFS and HBase module types are included
so any unforeseen processing scenarios can be handled.
There are two command line tools included in the engine:
These tools can be used in an adhoc fashion for testing modules but also form
the basis for running the offline process within the ARCOMEM system as a whole.