ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

TryIt

The whole system based on the Heritrix crawler is released as open source to
the public. Since many components or composite tools are of interest also for
other areas and usage scenarios, the ARCOMEM consortium defined a number of
pre-packaged tools which can be used independently from each other. By
combining all packages the full ARCOMEM system can be built. The following
major packages will be released in the coming weeks as pre-compiled packages
with source code.

The ARCOMEM system was tested under Debian Squeeze, with Cloudera's Hadoop and
HBase CDH 4.5.0, Oracle's Java 1.7 JDK and python 2.6.

Crawler and Analysis Packages

Guided crawling: Adapted Heritrix plus Online Analysis

This package includes an adapted version of the Heritrix crawler that allows a
prioritisation of the crawler queue as well as a basic online analysis of
crawled content. The crawler takes an intelligent crawl specification (ICS)
created by hand or, optionally, with the crawler cockpit by the Crawler Cockpit
(s. applications below) and uses the results of the online analysis for crawler
guidance. The guidance is based on keywords and entities as specified in the
ICS. Identified links on pages are ranked wrt. coverage of page compared to the
crawl specification. The results of the crawl are focused Web archives.

This package contains in addition the ARCOMEM API crawler for crawling social
media sites like Twitter, YouTube or Facebook. Links mentioned in social media
documents, tweets, etc. are extracted and analysed and used as an initial
seed list for the adaptive Heritrix crawler as described above.

Set-up

Hadoop and HBase;
the Triple Store (our RDF store based on HBase, here used to serve
the ICS);
setting up the ICS by hand (requires the
rdfstore sources) or using the Crawler cockpit (requires
the cockpit sources);
Adaptive Heritrix;
optionally, the API crawler to get many relevant seeds
from social sites quickly; note that many offline modules also work
specifically on social sites content;
the Online analysis.

Quick start: get the JDK and hadoop_hbase_setup.sh, put them under
root's home and follow the set-up instructions of the first component above.

Put in hbase's home zookeeper-3.4.3_light.tar.gz, ApiCalls.tgz and
H2RDF.jar, and follow the set-up instructions of the second component.

You will also need the latest rdfstore or cockpit sources.

Install the adaptive Heritrix and install the cron job to move the WARCs.

Lastly, put in hbase's home arcomem-framework.online.tgz, and follow the
installation instructions (section "Analysis side", no configuration
modification is needed if you are running everything locally).

Offline Analysis

The Offline Analysis package performs various kinds of analysis on Web archive
content. The results of the analysis are collected in the H2RDF store and can
either be directly accessed by applications or exported as serialised RDF for
archiving or further usage. The following analysis modules are provided:

Semantic Analysis: Named entities recognition, Event detection, Opinion
mining (all based on GATE), Topic detection, Linked Data Enrichments and
Consolidation.
Social Web Analysis: Cultural analysis of Twitter, News comment and tweet
diversification, social search, trending topics detection, Twitter dynamics.
Multimedia Analysis: Face detection from images and videos, duplicate
video detection, opinion identification from images.

The packages described above have been developed for medium and large scale
crawls and content collections in mind. Therefore the ARCOMEM system follows
the Map Reduce paradigm in order easily distribute the processing among hosts
within a computing cluster. Apache Hadoop is used as the Map Reduce framework
complemented with HBase for content handling. H2RDF is a RDF triple store
developed on top of HBase that allows the storage and handling of large amounts
of extracted meta information.

Set-up

Quick start: get the JDK and hadoop_hbase_setup.sh, put them under root's
home and follow the set-up instructions of the first component below. Put in
hbase\s home zookeeper-3.4.3_light.tar.gz, H2RDF.jar and ApiCalls.tgz,
and follow the set-up instructions of the second component. Lastly, put in
hbase's home sample_af_export.warc.gz and arcomem-framework.offline.tgz,
and run the commands below.

Components:

Offline quick start steps, as user hbase:

tar xzf arcomem-framework.offline.tgz
cd arcomem-framework

# load some WARC content into HBase
PATH=$PATH:$PWD/ingestion_scripts/hbase_side
hadoop fs -copyFromLocal ~/sample_af_export.warc.gz /exchange/import/test/
cd ingestion_scripts/hbase_side
./load_warcs test test test bulk_load

# build an off-line configuration and run it
cd ../../configurations
./build_conf.sh test localhost test_kb gate > test_gate.xml
cd ..
tools/run_combined_off-line configurations/test_gate.xml

You can follow the map-reduce job at http://localhost:50030/.

Applications

Within ARCOMEM two major applications have been developed for crawler handling
and Web archive access:

Crawler Cockpit

The crawler cockpit offers integrated features managed through a web
interface. It manages the main part of the Web Archiving process: creating and
launching campaigns, and viewing statistics about the crawls. A campaign is
described by an intelligent crawl definition, which associates content target
to crawl parameters (schedule and technical parameters). The content definition
is made of: distinct named entities (e.g. person, place, and organisation),
time period, free keywords, social media categories, etc.. At the end of the
crawls, users get access to an overview of the data collected through different
widgets.

SARA – Search and Retrieval Application

The Search and Retrieval Application (SARA) is a web application that provides
an intuitive user interface for search and retrieval of archived web
documents. It enables users to full-text search and semantic queries to an
incredibly fast indexed archive. The raw content and the semantic metadata are
indexed in Solr. Free text search as well as query enabled semantic search
comprised the search functionality. The returned results are web resources that
match the search string. The results can be further refined various facets like
topics, entities, opinions, etc.

ARCOMEM Light Tools

Since the major tools released by ARCOMEM are based on Apache Hadoop and HBase
require therefore installation ad handling of many additional libraries the
ARCOMEM consortium decided to release "light" version of certain tools. This
will allow users to easily try some ARCOMEM analysis tools without the burden
of complex system handling.

ARCOMEM Lightweight Semantic Analysis (ARCOLight)

ARCOMEM Lightweight Semantic Analysis (ARCOLight) is an autonomous system
for the extraction of semantic information and their linked data enrichment.
It consists of two main components, which manage the extraction of named
entities from archived websites and the enrichment of these entities with
semantic information. ARCOLight does not depend on any external systems.
This allows users to easily install and run it on existing Web archives.
ARCOLight takes a collection of WARC files as input and produces an output
with extracted information in XML as well as RDF/Turtle format. This allows
the users to integrate the generated information into their systems by using
existing tools. ArcoLight incorporates online knowledge bases during the
enrichment process and does not depend on any offline versions to be
available. This makes ARCOLight a flexible, lightweight as well as portable
system. It is easy to deploy and works without any technical prerequisites.

ARCOMEM Named Entity Evolution Recognizer (NEER)

NEER is an unsupervised method for named entity evolution recognition
independent of external knowledge sources. It finds time periods with high
likelihood of evolution. By analysing only these time periods using a sliding
window co-occurrence method it captures evolving terms in the same
context. Thus it avoids comparing terms from widely different periods in time
and overcome a severe limitation of existing methods for named entity
evolution.

Wiki: APICrawler
Wiki: AdaptiveHeritrix
Wiki: Cockpit
Wiki: CrawlSpec
Wiki: HadoopHBase
Wiki: Home
Wiki: KB
Wiki: OfflineAnalysis
Wiki: OnlineQuickStart

ARCOMEM Wiki

Semantic and social web crawling

TryIt

Crawler and Analysis Packages

Guided crawling: Adapted Heritrix plus Online Analysis

Set-up

Offline Analysis

Set-up

Applications

Crawler Cockpit

SARA – Search and Retrieval Application

ARCOMEM Light Tools

ARCOMEM Lightweight Semantic Analysis (ARCOLight)

ARCOMEM Named Entity Evolution Recognizer (NEER)

Related