ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

OnlineQuickStart

Authors:

Online quick start

Analysis side

As use hbase:

tar xzf arcomem-framework.online.tgz
mkdir log

and put this into his crontab:

JAVA_HOME=/opt/jdk1.7.0_51
PATH=/var/lib/hbase/arcomem-framework/ingestion_scripts/common:/var/lib/hbase/arcomem-framework/ingestion_scripts/hbase_side:/usr/local/bin:/usr/bin:/bin:/opt/jdk1.7.0_51/bin

# m h  dom mon dow   command
* * * * * cd /var/lib/hbase/arcomem-framework/ingestion_scripts/hbase_side; load_warcs_run_online >> $HOME/log/load_warcs_run_online 2>&1

In arcomem-framework, under ingestion_scripts/hbase_side/, adapt
arcomem_env to point to the correct jars, network endpoints and directories:

ONLINE_JAR (the online analysis code),
IMDIST_JAR (IMF HBase access library),
CAMPAIGN_HDFS_PATH (specify /exchange/import on the HDFS file system),
H2RDF_host (the Zookeeper host),
INGESTION_LOG_DIR (the directory must exist).

load_warcs_run_online will randomly select a campaign to ingest among the
ones that have new WARCs, and run the online analysis on it. It will not run
concurrently with itself.

Crawlers side

To have the WARCs automatically transferred to the HDFS, the following
instructions must be applied on all machines running the API crawler or the
adaptive Heritrix.

Install Cloudera's hadoop-hdfs package.

Unpack arcomem-framework.online.tgz, under
arcomem-framework/ingestion_scripts/crawler_side/, adapt log_dir in
transfer_to_hdfs.conf. Install bc and put this in the crontab:

JAVA_HOME=/opt/jdk1.7.0_51
PATH=/home/arcomem/arcomem-framework/ingestion_scripts/common:/home/arcomem/arcomem-framework/ingestion_scripts/crawler_side:/usr/local/bin:/usr/bin:/bin:/opt/jdk1.7.0_51/bin

* * * * * transfer_all_jobs_to_hdfs -h $HOME/heritrix-3.1.1_adaptive/jobs hdfs://localhost:9000/exchange/import >> $HOME/log/transfer_all_jobs_to_hdfs 2>&1

adapting the path to the framework's scripts in PATH, the path to the crawler's
WARC root directory, the HDFS host and the log path. Specify -h if the
crawler is Heritrix, -a for the API crawler.

Wiki: TryIt