ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

Architecture

ARCOMEM Architecture

The ARCOMEM framework has two main parts: the crawling sub-system and the
analysis sub-system. The crawling part is activated during a crawl
campaign as pages are stored into the main database and it is used to guide
the crawling based on some predefined crawl strategy. This phase is called the
online phase and its design is described in the subsection Online Design.
The analysis phase is run after the crawling has taken place. This runs
separately on as a clustered process and we call this the offline phase. The
offline phase design is described in the subsection Offline System Design.

Online phase modules help to guide the crawler to choose which web resources to
crawl and must run in near-real-time; that is, they should not take more than a
couple of seconds to run per document at most. The output of an online module
is a score for each web resource passed to it.

The offline phase modules extract information from the resources that were
crawled and depending on the number of resources and the time and
machines available, they can take longer. The information these modules
generate can be used by the online phase modules during for guiding the
crawl. The output of an offline module can be anything and can be stored either
in the triple-store, back in the HBase or on to disk.

To make the analysis scalable, the simplest way is to process resources
independently. The on-line analysis works exclusively on this basis. In the
off-line analysis, we recommend using the map-reduce style of process and this
same principle for the same reason. However, if needed, an off-line process can
be run in a centralised fashion.

As a consequence, it is important to realise that all the online modules and
most offline modules run distributed over a Hadoop cluster, so you cannot
assume that modules will be able to talk to each other, even via the local
disk.

The Arcomem framework contains the code to drive online and offline analysis
within the Arcomem system, in addition to providing the functionality for
things such as accessing the knowledge base (triple store). The framework is
separated into four submodules as follows:

arcomem-framework-core core framework bits \'n bobs, including interfaces
for the AAH and triple store.
online-analysis online analysis depends on core &
offline-analysis-framework (for running the online process as a Map-Reduce
process rather than a HBase trigger)
offline-analysis-framework command line tools, configuration support and
interfaces for offline analysis dependent on arcomem-framework-core
offline-analysis-modules actual offline module implementations dependent
on offline-analysis-framework

Here\'s the dependency tree for the framework modules:

Arcomem Framework Dependency Tree

Online System Design

During a crawl campaign, a crawler grabs resources from the web and imports
them into an HBase database. The ARCOMEM system will then take each of these
resource in turn (either through a triggered HBase RegionObserver or during
a short-term map-reduce analysis on the data) and performs a quick analysis
to determine whether outlinks from the resource might be interesting to crawl.
The analysis is based on template extraction, link black lists and entity
extraction. Scores are calculated based on the document’s similarity to the
crawl specification. Link scores are used to update a sorted queue from the
top of which the crawler is taking URLs. The document itself is also given a
score which is placed back into the HBase.

The online processing conceptual design

The figure above shows the modular design of the online phase.
Each arrow within this diagram represents some form of API between the two
modules involved; however, some of these APIs are implemented using existing,
off-the-shelf, solutions. Table 1 describes each of these module interactions
and the corresponding API implementation. Within the online phase module, each
of the individual analysis modules is controlled and coordinated by the online
phase module itself, so the diagram above shows the conceptual workflow
whereas the table below has the interactions within the online phase model
occur between the analysis module and the online phase module. In the table,
the module scope states whether the interface is entirely within the framework
(for modularisation of the code) or acts as an input or output to the framework
(for framework control or configuration).

Source	Destination	Data Input Payload	Description	Scope
HTML Crawler	HBase	Documents/Web Resources	This API uses an off-the-shelf solution and is not part of the framework codebase. Firstly the resources are written to a WARC file and then this is imported into the HBase using a tool from IMF.	Input
HBase	Online Phase Module	Document/Web Resource	This API is based on an existing interface definition. As the online phase can run in two separate ways, the API used depends on the execution method. The online phase may be triggered using a `RegionObserver` interface, a standard part of HBase 0.92.0 and up. Alternatively, the online-phase is run regularly as a map-reduce job in which case, the documents are received by the online phase using the standard Hadoop implementations. This API allows any resource generator to utilise the ARCOMEM guidance.	Internal
Application Aware Helper	Online Phase Module	Document	This API is implemented in Java by the interface `ApplicationAwareHelper`. It accepts a document and returns an augmented document through an `AugmentedDocument` object. This API allows any Application Aware Helper implemented to be used within the framework.	Internal
Analysis Modules	Online Phase Module	Augmented Document	This API is defined by the `AnalysisModule` interface. An analysis module accepts an `AugmentedDocument` and returns a list of `AnalysisResult`s The `AnalysisModule` API allows any analysis module to be used within the framework during a crawl.	Internal
Prioritization Module	Online Phase Module	Analysis Results	This module scores aggregates the `AnalysisResult`s using the `PriorityAggregationStrategy` API. Implementations for aggregation strategies are provided in the codebase but new implementations can be used if this API is adhered to.	Internal
Prioritization Module	Priority Queue	Link scores	This API is implemented in Java in the interface `UrlScoresUpdater`. It allows any implementation to be used for sending scores from the online analysis module to the crawler’s queue. It takes a set of link scores.	Internal
Prioritization Module	HBase	Document scores	This API is implemented in Java in the interface `DocumentScoreStorer`. It allows any implementation to be used for sending scores from the online analysis module to somewhere to be stored. It takes a set of document scores.	Internal
Priority Queue	HTML Crawler	Scored links	To integrate a crawler within the framework, it must have some implementation of a prioritised URL store and implement the queue update interface.	Output
API Crawler	H2RDF	Triples	This API is provided by the H2RDF database and allows triples to be sequentially or bulk dumped into the knowledgebase. This API is used by multiple modules in the offline phase but only by the API crawler in the online phase. See the offline system design section for more information.	External

For information about getting and setting crawl specifications, see
Crawl Specifications.

Offline System Design

The API for the offline system design is based around the Map-Reduce framework
with some other specifics related to the module\'s output.

For an overview of the offline system design, jump to the
Offline Analysis documentation and for specifics
about the module design, jump to the section on
Implementing an Offline Module.

For specifics about how offline modules output information, jump to the
section on Offline Module Outputs. For information
on using the data model for input and output, see the section on the
Data Model.

Wiki: CrawlSpec
Wiki: DataModel
Wiki: Home
Wiki: OfflineAnalysis
Wiki: OfflineModuleImpl
Wiki: OfflineOutputs