The whole system based on the Heritrix crawler is released as open source to
the public. Since many components or composite tools are of interest also for
other areas and usage scenarios, the ARCOMEM consortium defined a number of
pre-packaged tools which can be used independently from each other. By
combining all packages the full ARCOMEM system can be built. The following
major packages will be released in the coming weeks as pre-compiled packages
with source code.
The ARCOMEM system was tested under Debian Squeeze, with Cloudera's Hadoop and
HBase CDH 4.5.0, Oracle's Java 1.7 JDK and python 2.6.
This package includes an adapted version of the Heritrix crawler that allows a
prioritisation of the crawler queue as well as a basic online analysis of
crawled content. The crawler takes an intelligent crawl specification (ICS)
created by hand or, optionally, with the crawler cockpit by the Crawler Cockpit
(s. applications below) and uses the results of the online analysis for crawler
guidance. The guidance is based on keywords and entities as specified in the
ICS. Identified links on pages are ranked wrt. coverage of page compared to the
crawl specification. The results of the crawl are focused Web archives.
This package contains in addition the ARCOMEM API crawler for crawling social
media sites like Twitter, YouTube or Facebook. Links mentioned in social media
documents, tweets, etc. are extracted and analysed and used as an initial
seed list for the adaptive Heritrix crawler as described above.
rdfstore
sources) or using the Crawler cockpit (requirescockpit
sources);Quick start: get the JDK and hadoop_hbase_setup.sh
, put them under
root's home and follow the set-up instructions of the first component above.
Put in hbase
's home zookeeper-3.4.3_light.tar.gz
, ApiCalls.tgz
and
H2RDF.jar
, and follow the set-up instructions of the second component.
You will also need the latest rdfstore
or cockpit
sources.
Install the adaptive Heritrix and install the cron job to move the WARCs.
Lastly, put in hbase
's home arcomem-framework.online.tgz
, and follow the
installation instructions (section "Analysis side", no configuration
modification is needed if you are running everything locally).
The Offline Analysis package performs various kinds of analysis on Web archive
content. The results of the analysis are collected in the H2RDF store and can
either be directly accessed by applications or exported as serialised RDF for
archiving or further usage. The following analysis modules are provided:
The packages described above have been developed for medium and large scale
crawls and content collections in mind. Therefore the ARCOMEM system follows
the Map Reduce paradigm in order easily distribute the processing among hosts
within a computing cluster. Apache Hadoop is used as the Map Reduce framework
complemented with HBase for content handling. H2RDF is a RDF triple store
developed on top of HBase that allows the storage and handling of large amounts
of extracted meta information.
Quick start: get the JDK and hadoop_hbase_setup.sh
, put them under root's
home and follow the set-up instructions of the first component below. Put in
hbase
\s home zookeeper-3.4.3_light.tar.gz
, H2RDF.jar
and ApiCalls.tgz
,
and follow the set-up instructions of the second component. Lastly, put in
hbase's home sample_af_export.warc.gz
and arcomem-framework.offline.tgz
,
and run the commands below.
Components:
Offline quick start steps, as user hbase:
tar xzf arcomem-framework.offline.tgz cd arcomem-framework # load some WARC content into HBase PATH=$PATH:$PWD/ingestion_scripts/hbase_side hadoop fs -copyFromLocal ~/sample_af_export.warc.gz /exchange/import/test/ cd ingestion_scripts/hbase_side ./load_warcs test test test bulk_load # build an off-line configuration and run it cd ../../configurations ./build_conf.sh test localhost test_kb gate > test_gate.xml cd .. tools/run_combined_off-line configurations/test_gate.xml
You can follow the map-reduce job at http://localhost:50030/.
Within ARCOMEM two major applications have been developed for crawler handling
and Web archive access:
The crawler cockpit offers integrated features managed through a web
interface. It manages the main part of the Web Archiving process: creating and
launching campaigns, and viewing statistics about the crawls. A campaign is
described by an intelligent crawl definition, which associates content target
to crawl parameters (schedule and technical parameters). The content definition
is made of: distinct named entities (e.g. person, place, and organisation),
time period, free keywords, social media categories, etc.. At the end of the
crawls, users get access to an overview of the data collected through different
widgets.
The Search and Retrieval Application (SARA) is a web application that provides
an intuitive user interface for search and retrieval of archived web
documents. It enables users to full-text search and semantic queries to an
incredibly fast indexed archive. The raw content and the semantic metadata are
indexed in Solr. Free text search as well as query enabled semantic search
comprised the search functionality. The returned results are web resources that
match the search string. The results can be further refined various facets like
topics, entities, opinions, etc.
Since the major tools released by ARCOMEM are based on Apache Hadoop and HBase
require therefore installation ad handling of many additional libraries the
ARCOMEM consortium decided to release "light" version of certain tools. This
will allow users to easily try some ARCOMEM analysis tools without the burden
of complex system handling.
ARCOMEM Lightweight Semantic Analysis (ARCOLight) is an autonomous system
for the extraction of semantic information and their linked data enrichment.
It consists of two main components, which manage the extraction of named
entities from archived websites and the enrichment of these entities with
semantic information. ARCOLight does not depend on any external systems.
This allows users to easily install and run it on existing Web archives.
ARCOLight takes a collection of WARC files as input and produces an output
with extracted information in XML as well as RDF/Turtle format. This allows
the users to integrate the generated information into their systems by using
existing tools. ArcoLight incorporates online knowledge bases during the
enrichment process and does not depend on any offline versions to be
available. This makes ARCOLight a flexible, lightweight as well as portable
system. It is easy to deploy and works without any technical prerequisites.
NEER is an unsupervised method for named entity evolution recognition
independent of external knowledge sources. It finds time periods with high
likelihood of evolution. By analysing only these time periods using a sliding
window co-occurrence method it captures evolving terms in the same
context. Thus it avoids comparing terms from widely different periods in time
and overcome a severe limitation of existing methods for named entity
evolution.
Wiki: APICrawler
Wiki: AdaptiveHeritrix
Wiki: Cockpit
Wiki: CrawlSpec
Wiki: HadoopHBase
Wiki: Home
Wiki: KB
Wiki: OfflineAnalysis
Wiki: OnlineQuickStart