ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

Simulator

How to Run a Crawl with the Crawler Simulator

The crawler simulator (\'fake-crawler\') is a means for testing the online
phase of the system on a single machine without having to
install one of the larger crawlers. The advantage is that the simulated crawler
can also read files from the disk, so you can have your very own web to crawl
in a directory on your disk. It will crawl the real web also, but it does not
have any politeness rules within it and does not adhere to robots.txt rules.

To run the crawler simulator you still need to install HBase to store the
documents. The simulator comes with a URL queue which receives results from
the online analysis.

HBase

Install and start HBase.

Get and Build the Framework

Change into the created directory and build the complete JAR

cd arcomem-framework
mvn assembly:assembly -DskipTests=true

This will create the file target/ArcomemFramework.jar that contains all
dependencies needed for the online phase.

Add the HBase Trigger

Edit the file ''conf/hbase-env.sh'' in the HBase directory and add (or
uncomment and change) the following line:

export HBASE_CLASSPATH=$arcomem_framework/target/ArcomemFramework.jar

where ''$arcomem_framework'' is the directory into which you downloaded the
framework code.

Furthermore, edit the file ''conf/hbase-site.xml'' and add the following lines
after <configuration> to add the Arcomem RegionServer into the HBase startup:

<property>
  <name>hbase.coprocessor.region.classes</name>
  <value>eu.arcomem.framework.hbase.OnlineAnalysisTrigger</value>
  <description>Trigger for the queueing of prioritized outlinks.</description>
</property>

Create a crawl specification

The simplest crawl specification is a properties file, which the default
framework configuration reads from ''/crawlspec.properties'' (root
directory). To specify the keywords used for rating document relevance, use
this format:

seed.keyword.1=George Bush
seed.keyword.2=Bush
seed.keyword.3=Cameron
seed.keyword.4=Tony Blair
seed.keyword.5=Obama

The prefix ''seed.keyword.'' is significant, the following number is arbitrary,
but must be unique. The keyword is the text after equals sign.

After creating or changing the crawl specification, restart HBase using:

bin/stop-hbase.sh && bin/start-hbase.sh

Download and run the Crawler Simulator

Create the HBase table used by the fake crawler by starting the HBase shell
(from the HBase directory) using:

bin/hbase shell

and entering the command

create 'warc_contents', 'content', 'meta'

This creates the table ''fake-crawler'' with the column families ''content''
and ''meta''. If successful, the shell will print:

0 row(s) in 1.8830 seconds

(time may differ).

Get the fake-crawler project and change into the directory.

Build the code using

mvn assembly:assembly -DskipTests=true

The FakeCrawler crawls files from a local directory. Prepare such a directory
of HTML files that link to each other and choose one or more starting
pages. Now you can start the crawler using:

java -cp target/FakeCrawler.jar eu.arcomem.fakecrawler.FakeCrawler --disable-queueing-outlinks --hbase-site $hbase_dir/conf/hbase-site.xml --base-directory $document_dir $seed_urls

where ''$hbase_dir'' is the directory where you unpacked HBase,
''$document_dir'' is the directory containing the files to crawl and
''$seed_urls'' is one or more page(s) in this directory (give the name relative
to the directory!) (''--disable-queueing-outlinks'' make the crawler wait for
information returned by the online analysis).

When you run this command, you will see lots of information about the process
of the crawler and the crawler queue. To get more information about the
processing done by the online phase, use ''tail -f'' on the filename that HBase
printed when you ran ''bin/start-hbase.sh'' earlier (logs/hbase-*.out).

To demonstrate the effect of the prioritization, you can also add the parameter
''--disable-adaptive-queue'' at the start of the parameter list. This makes the
crawler behave like a standard crawler, i.e. crawl all links in FIFO order,
including links to irrelevant pages.

Wiki: HadoopHBase
Wiki: OnlineAnalysis