As use hbase
:
tar xzf arcomem-framework.online.tgz mkdir log
and put this into his crontab:
JAVA_HOME=/opt/jdk1.7.0_51 PATH=/var/lib/hbase/arcomem-framework/ingestion_scripts/common:/var/lib/hbase/arcomem-framework/ingestion_scripts/hbase_side:/usr/local/bin:/usr/bin:/bin:/opt/jdk1.7.0_51/bin # m h dom mon dow command * * * * * cd /var/lib/hbase/arcomem-framework/ingestion_scripts/hbase_side; load_warcs_run_online >> $HOME/log/load_warcs_run_online 2>&1
In arcomem-framework
, under ingestion_scripts/hbase_side/
, adapt
arcomem_env
to point to the correct jars, network endpoints and directories:
load_warcs_run_online
will randomly select a campaign to ingest among the
ones that have new WARCs, and run the online analysis on it. It will not run
concurrently with itself.
To have the WARCs automatically transferred to the HDFS, the following
instructions must be applied on all machines running the API crawler or the
adaptive Heritrix.
Install Cloudera's hadoop-hdfs
package.
Unpack arcomem-framework.online.tgz
, under
arcomem-framework/ingestion_scripts/crawler_side/
, adapt log_dir
in
transfer_to_hdfs.conf
. Install bc
and put this in the crontab:
JAVA_HOME=/opt/jdk1.7.0_51 PATH=/home/arcomem/arcomem-framework/ingestion_scripts/common:/home/arcomem/arcomem-framework/ingestion_scripts/crawler_side:/usr/local/bin:/usr/bin:/bin:/opt/jdk1.7.0_51/bin * * * * * transfer_all_jobs_to_hdfs -h $HOME/heritrix-3.1.1_adaptive/jobs hdfs://localhost:9000/exchange/import >> $HOME/log/transfer_all_jobs_to_hdfs 2>&1
adapting the path to the framework's scripts in PATH, the path to the crawler's
WARC root directory, the HDFS host and the log path. Specify -h
if the
crawler is Heritrix, -a
for the API crawler.