The online phase of the Arcomem framework is active during a crawl. The
crawler inserts resources it has fetched from the web into the document
store which is, in our case, an HBase. The resources are analysed and scored.
Any links in the resources also receive a score which are sent back to the
URL queue, thereby guiding the crawler on what to crawl next. Pages can also
be blacklisted to avoid the crawler from fetching them at all. An API crawler,
which is crawling various data-feeds from certain websites, also adds resources
into the document store, so adding links into the main crawler\'s queue.
The analysis of the documents can be triggered in two ways. The first is
using the HBase RegionObserver trigger which is fired when a new item is put into
the store. The second is by running a map-reduce job over the HBase. Either way
results in the same scores for documents. Although the RegionObserver is more timely,
the map-reduce is likely to be more robust.
The main crawler (which can either be the crawler simulator, Internet Memory\'s
crawler or Athena\'s enhanced Heritrix) crawls URLs it finds in the queue. So, to
start a crawl some URLs must be manually added to this queue - these are known as
seed URLs.
The API crawler inserts resources it fetches into the document store but it also
outputs structured information into a triple store. As the API crawler is crawling sites
like Flickr, Twitter and YouTube, this information includes structured items such as
who the author of a particular comment, blog-post, or tweet is.
The diagram below shows a conceptual view of the online phase architecture.
Each of the crawlers insert a row into the HBase when they fetch it from the web, so
each row in the HBase represents a single crawled object (a web object). There are
three column families associated with each object:
This structure is wrapped up in the code by a CrawledObject
class which gives generic
access to the underlying data which allows other implementations to be provided if and when
the database structure changes.
The online analysis is either triggered by the HBase when an item is inserted or it is run
as a map-reduce job over the HBase. In either case, the procedure receives a row of the HBase
table as input representing a single crawled object.
The procedure then follows a specific workflow which is as follows:
Scored are delivered from the document scorer to the priority queue by a
UrlScoreUpdater. There is an implementation
(the RestUrlScoreUpdater) which sends
scores to the crawler REST interface as JSON which has been defined in Arcomem to interface the IMF crawler,
the Heritrix crawler and the Arcomem online analysis.