ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

OnlineAnalysis

Online analysis

Overview

The online phase of the Arcomem framework is active during a crawl. The
crawler inserts resources it has fetched from the web into the document
store which is, in our case, an HBase. The resources are analysed and scored.
Any links in the resources also receive a score which are sent back to the
URL queue, thereby guiding the crawler on what to crawl next. Pages can also
be blacklisted to avoid the crawler from fetching them at all. An API crawler,
which is crawling various data-feeds from certain websites, also adds resources
into the document store, so adding links into the main crawler\'s queue.

The analysis of the documents can be triggered in two ways. The first is
using the HBase RegionObserver trigger which is fired when a new item is put into
the store. The second is by running a map-reduce job over the HBase. Either way
results in the same scores for documents. Although the RegionObserver is more timely,
the map-reduce is likely to be more robust.

The main crawler (which can either be the crawler simulator, Internet Memory\'s
crawler or Athena\'s enhanced Heritrix) crawls URLs it finds in the queue. So, to
start a crawl some URLs must be manually added to this queue - these are known as
seed URLs.

The API crawler inserts resources it fetches into the document store but it also
outputs structured information into a triple store. As the API crawler is crawling sites
like Flickr, Twitter and YouTube, this information includes structured items such as
who the author of a particular comment, blog-post, or tweet is.

The diagram below shows a conceptual view of the online phase architecture.

Arcomem Online Framework Architecture

The Crawled Object

Each of the crawlers insert a row into the HBase when they fetch it from the web, so
each row in the HBase represents a single crawled object (a web object). There are
three column families associated with each object:

URI: The unique identifier of the web-object
Content: The actual content that was crawled
Metadata: Other information about the content, such as the HTTP Response.

This structure is wrapped up in the code by a CrawledObject class which gives generic
access to the underlying data which allows other implementations to be provided if and when
the database structure changes.

Components of the Online Analysis

The online analysis is either triggered by the HBase when an item is inserted or it is run
as a map-reduce job over the HBase. In either case, the procedure receives a row of the HBase
table as input representing a single crawled object.

The procedure then follows a specific workflow which is as follows:

Application Aware Helper (AAH): The AAH analyses a page to determine if it is built from
a known template-based website provider (such as Wordpress for blogs, PHPBB for forums, etc.).
If it is known, the AAH will automatically blacklist links that are not worth following,
or prioritise links that should be followed. The AAH may also augment other parts of the page
with semantic information. If it does not recognise the document type then the page is
passed through the AAH unaffected. Links from the AAH pass to the link scorer while augmented
documents pass to the NLP analysis.
Link Scorer: The link scorer makes some guess (using regular expressions)
at the destination of the link from the link URL itself and from that scores the link.
Currently, the default implementation compares the link URL against a black-list of advertisement providers.
NLP Analysis: The NLP analysis is provided by GATE and provides tokenisation, lemmatisation of open-class
words (nouns, verbs, adjectives and adverbs), case-insensitive keyword matching, and named-entity recognition,
all for English and German.
Document Scorer: The document scorer module takes the information from the NLP modules
and calculates a single score for the whole document based on the relevance of
the extracted information to the crawl specification (using vector similarity).
Relevance Scorer: This determines a score for each of the links in the document and
sends the new score to the priority queue for the crawler to retrieve.

Scored are delivered from the document scorer to the priority queue by a
UrlScoreUpdater. There is an implementation
(the RestUrlScoreUpdater) which sends
scores to the crawler REST interface as JSON which has been defined in Arcomem to interface the IMF crawler,
the Heritrix crawler and the Arcomem online analysis.

Wiki: Simulator