The Adaptive Heritrix is a modified version of the open source crawler Heritrix
that allows the dynamic reordering of queued URLs and receiving URLs from the
Online Analysis using this protocol.
It is implemented in Java and is released under GPLv2.
Do all of the following as user arcomem
, or adapt the configurations as
needed.
Get a copy of heritrix-3.1.1_adaptive.tgz
and run:
mkdir log tar xzf heritrix-3.1.1_adaptive.tgz cd heritrix-3.1.1_adaptive cp -a jobs/template jobs/test vi jobs/test/crawler-beans.cxml
Edit the metadata.operatorContactUrl
and the seeds. Then, start Heritrix and
start the job normally. The adaptive crawler is now ready to receive URLs with
scores using the ARCOMEM URL enqueueing protocol.
To enable the periodic WARC tansfer to HDFS, follow the instructions for the
online's crawler side.