ARCOMEM Wiki

Semantic and social web crawling

HadoopHBase

Installation

First, install a JRE or JDK. ARCOMEM was tested with Oracle's Java SE 1.7 JDK
that you can get at http://www.oracle.com/us/downloads/index.html.

As root:

tar -C /opt -xzf jdk-7u51-linux-x64.tar.gz

To install and start a pseudo-distributed system on a clean Squeeze system, you
can simply run as root (this will rewrite some configuration files if they
already exist, adapt it to your needs):

./hadoop_hbase_setup.sh

Then, install and start the Zookeeper we provide (see the "Set up" section on
the Triple Store page).

Lastly, start HBase:

/etc/init.d/hbase-master start
/etc/init.d/hbase-regionserver start

If you want to use a different Hadoop/HBase set up, note that the ARCOMEM
system was tested with the Cloudera distribution CDH 4.4.0. Its installation
is documented here.
You can go straight to http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html. For
development purpose, it will be faster to set up a pseudo-distributed
system.

It is recommended to bump some system settings as described
here.

Be sure to read about the HDFS shell commands and the web interfaces in the
user guide and the HBase web interface.

You can read more about HDFS here and HBase
there.

Testing modules with your own local hadoop cluster

Once you have a cluster setup and running, you can download the ARC files from
a crawl and copy them to your HDFS using:

hadoop dfs -copyFromLocal {from} {to}

Once the ARC files are on the HDFS, they can be imported into the HBase using
IM's HBaseLoad tool as described on this
page.

Once the ARC files are imported into the HBase instance, you can run any of the
modules on your local cluster.

The Record Inspector Tool

It is possible to scan the hbase to find records that you want using the hbase
shell. For example, you may find all videos using the following hbase shell
command:

scan 'crawl_table', {COLUMNS => "meta:mime", FILTER =>
    org.apache.hadoop.hbase.filter.SingleColumnValueFilter.new(
        'meta'.to_java_bytes(),
        'mime'.to_java_bytes(),
        org.apache.hadoop.hbase.filter.CompareFilter::
            CompareOp.valueOf('EQUAL'),
        org.apache.hadoop.hbase.filter.SubstringComparator
            .new('video/'))
}

However, it's clearly a bit of a hassle to put all that into the hbase shell
and even then, it doesn't make it easy to actually get at and examine the data.

So, the RecordInspector tool comes to the
rescue. This tool allows you to specify a URL (an HBase key, in fact) and will
provide access to both the content and the metadata.

By default the content of the resource is written to the stdout, so it can be
redirected to a file or piped to other commands. If you do not wish the tool
to output to the stdout, use the -noout command-line flag.

You can display the resource directly using the -display flag which currently
recognises mime types that begin text/* and image/*. It uses the reported
mime-type to determine this but you can use the -useDetectedMime flag to make
the tool use the detected mime-type of the resource.

The tool can output the resource directly to a file, using the -file <file>
argument.

The tool can also produce a metadata report for the resource when the
-metadata <file> argument is issued. This will write a text file to the given
filename that contains the report.

Wiki: KB
Wiki: MultiModules
Wiki: Simulator
Wiki: SingleModule
Wiki: TryIt