The crawler simulator (\'fake-crawler\') is a means for testing the online
phase of the system on a single machine without having to
install one of the larger crawlers. The advantage is that the simulated crawler
can also read files from the disk, so you can have your very own web to crawl
in a directory on your disk. It will crawl the real web also, but it does not
have any politeness rules within it and does not adhere to robots.txt
rules.
To run the crawler simulator you still need to install HBase to store the
documents. The simulator comes with a URL queue which receives results from
the online analysis.
Install and start HBase.
Change into the created directory and build the complete JAR
cd arcomem-framework mvn assembly:assembly -DskipTests=true
This will create the file target/ArcomemFramework.jar that contains all
dependencies needed for the online phase.
Edit the file ''conf/hbase-env.sh'' in the HBase directory and add (or
uncomment and change) the following line:
export HBASE_CLASSPATH=$arcomem_framework/target/ArcomemFramework.jar
where ''$arcomem_framework'' is the directory into which you downloaded the
framework code.
Furthermore, edit the file ''conf/hbase-site.xml'' and add the following lines
after <configuration>
to add the Arcomem RegionServer into the HBase startup:
<property> <name>hbase.coprocessor.region.classes</name> <value>eu.arcomem.framework.hbase.OnlineAnalysisTrigger</value> <description>Trigger for the queueing of prioritized outlinks.</description> </property>
The simplest crawl specification is a properties file, which the default
framework configuration reads from ''/crawlspec.properties'' (root
directory). To specify the keywords used for rating document relevance, use
this format:
seed.keyword.1=George Bush seed.keyword.2=Bush seed.keyword.3=Cameron seed.keyword.4=Tony Blair seed.keyword.5=Obama
The prefix ''seed.keyword.'' is significant, the following number is arbitrary,
but must be unique. The keyword is the text after equals sign.
After creating or changing the crawl specification, restart HBase using:
bin/stop-hbase.sh && bin/start-hbase.sh
Create the HBase table used by the fake crawler by starting the HBase shell
(from the HBase directory) using:
bin/hbase shell
and entering the command
create 'warc_contents', 'content', 'meta'
This creates the table ''fake-crawler'' with the column families ''content''
and ''meta''. If successful, the shell will print:
0 row(s) in 1.8830 seconds
(time may differ).
Get the fake-crawler project and change into the directory.
Build the code using
mvn assembly:assembly -DskipTests=true
The FakeCrawler crawls files from a local directory. Prepare such a directory
of HTML files that link to each other and choose one or more starting
pages. Now you can start the crawler using:
java -cp target/FakeCrawler.jar eu.arcomem.fakecrawler.FakeCrawler --disable-queueing-outlinks --hbase-site $hbase_dir/conf/hbase-site.xml --base-directory $document_dir $seed_urls
where ''$hbase_dir'' is the directory where you unpacked HBase,
''$document_dir'' is the directory containing the files to crawl and
''$seed_urls'' is one or more page(s) in this directory (give the name relative
to the directory!) (''--disable-queueing-outlinks'' make the crawler wait for
information returned by the online analysis).
When you run this command, you will see lots of information about the process
of the crawler and the crawler queue. To get more information about the
processing done by the online phase, use ''tail -f'' on the filename that HBase
printed when you ran ''bin/start-hbase.sh'' earlier (logs/hbase-*.out).
To demonstrate the effect of the prioritization, you can also add the parameter
''--disable-adaptive-queue'' at the start of the parameter list. This makes the
crawler behave like a standard crawler, i.e. crawl all links in FIFO order,
including links to irrelevant pages.