Menu

AdaptiveHeritrix

John Arcoman

Adaptive Heritrix

The Adaptive Heritrix is a modified version of the open source crawler Heritrix
that allows the dynamic reordering of queued URLs and receiving URLs from the
Online Analysis using this protocol.

It is implemented in Java and is released under GPLv2.

Set up

Do all of the following as user arcomem, or adapt the configurations as
needed.

Get a copy of heritrix-3.1.1_adaptive.tgz and run:

mkdir log
tar xzf heritrix-3.1.1_adaptive.tgz
cd heritrix-3.1.1_adaptive
cp -a jobs/template jobs/test
vi jobs/test/crawler-beans.cxml

Edit the metadata.operatorContactUrl and the seeds. Then, start Heritrix and
start the job normally. The adaptive crawler is now ready to receive URLs with
scores using the ARCOMEM URL enqueueing protocol.

To enable the periodic WARC tansfer to HDFS, follow the instructions for the
online's crawler side.


Related

Wiki: QueueUpdateItf
Wiki: TryIt

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.