ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

MultiModules

Running a Complete Offline Phase

To run any modules, you will need to have a Hadoop installation with the configuration of your
cluster on the machine you intend to launch the modules from. If you are using a local module
then it will run on the the machine the module is launched from, whereas all other modules will
run on the cluster. Typically, you will either log-in to one of the ARCOMEM machines at IMF
to run jobs or will have a local cluster setup for development and testing.
The following instructions assume that Hadoop is properly configured and available on your PATH.
You also need to have an assembled version of ArcomemOffline.jar, which can be build by
performing a mvn assembly:assembly in the offline-process-modules sub-project.

Running a complete offline phase consists of operating a number of modules in a sequence.
The ARCOMEM OfflineProcessRunner tool allows the running of module sequences through a
flexible XML-based configuration file. The tool supports a number of powerful features:

Automatic dependency resolution: module configurations can specify dependencies on other modules
and the tool will determine the correct order in which to run them.
Per-module JVM arguments: (Non-local) Modules can specify their own memory limits and other
JVM/Hadoop related arguments; these do not interfere across modules. Local modules run in
the same JVM as the tool and inherit its configuration.
Per-module configuration: modules have their own distinct configuration which includes
module-defined options and common options such as filters (for standard modules).
Shared knowledge base configuration: All modules can share a common triple store configuration.

The OfflineProcessRunner tool

Running the tool is simple:

hadoop jar ArcomemOffline.jar OfflineProcessRunner options

Their are currently only two options for the tool:

--config-file (-c) FILE : (Required) The configuration file
--dry-run               : (Optional) Perform a dry run, validating the
                          configuration and printing the order of the modules
                          without executing them.

The tool performs a number of checks on the configuration to ensure that it is valid.
If any problems are found they will be reported on the command line and the tool will abort.

The Configuration File

The configuration for the OfflineProcessRunner tool is a fairly simple XML file as shown in the example below:

<offlineConfiguration>
  <tripleStore>
    <H2RDF>
      <url>http://test.url</url>
      <user>jon</user>
      <table>the-table</table>
    </H2RDF>
  </tripleStore>
  <processes>
    <standardProcess>
      <id>image</id>
      <class>SampleImageProcess</class>
      <dependencies>
            <dependency>nlp</dependency>
      </dependencies>
            <configuration>
                <entry>
                    <key>min.facedetect.size</key>
                    <value>80</value>
                </entry>
            </configuration>
      <filter>
        <mimetype>
            <regex>^image/.*$</regex>
        </mimetype>
      </filter>
    </standardProcess>
    <standardProcess>
      <id>nlp</id>
      <class>SampleGateProcess</class>
    </standardProcess>
  </processes>
</offlineConfiguration>

The file defines the connection configuration for the tripleStore (see The Triple Store Connector)
and a list of processes (modules) each with their own configuration. Each process has an identifier
(which must be unique) and can optionally specify dependencies on other modules by specifying their respective identifiers.

Triple store configuration

Three types of tripleStore configuration are supported: H2RDF, sesameMemory and sesameRemote.
Each configuration has its own options as illustrated below:

<tripleStore>
    <sesameMemory>
        <destination>STD_OUT</destination> <!-- (Optional) Where to write RDF to. Either STD_OUT or FILE. Defaults to STD_OUT if not specified. -->
        <filename>VAL</filename> <!--(Optional) name of RDF output file, if <destination> is FILE. -->
    </sesameMemory>
</tripleStore>

<tripleStore>
    <sesameRemote>
        <url>VAL</url> <!-- (Required) url of the remote store. -->
    </sesameRemote>
</tripleStore>

<tripleStore>
    <H2RDF>
        <url>VAL</url> <!-- (Required) table name of the store -->
        <user>VAL</user> <!-- (Required) url of the store -->
        <table>VAL</table> <!-- (Required) username for connecting to the store -->
    </H2RDF>
</tripleStore>

See The Triple Store Connector to understand how this information translates to the code.

Module configuration

Offline Modules are specified through <*Process> blocks. There are separate process blocks for each of the four types of module (standard, local, hbase and hdfs). As mentioned previously, all processes must specify a unique identifier (<id>...</id>) and can optionally specify dependencies on other modules. All processes must specify the implementation class (<class>...</class>). Note that when specifying the module class, if the module is in the eu.arcomem.framework.offline.processes package you need only enter the name; otherwise you must enter the fully qualified classname. In addition, processes can also optionally specify additional key-value configuration pairs which are available to the module implementation:

<configuration>
    <entry>
        <key>key1</key>
        <value>value1</value>
    </entry>
    <entry>
        <key>key2</key>
        <value>value2</value>
    </entry>
</configuration>

Each type of module also has some additional parameters associated with it as shown below. For clarity, the class, id and configuration entries have been omitted:

<standardProcess>
    <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process-->
    <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. -->
    <table>VAL</table> <!-- (Optional) The name of the HBase table to process. Defaults to warc_contents. -->
    <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. -->
    <filter> <!-- (Optional) specify a filter -->
        <mimetype> <!-- currently there is only a mimetype filter -->
            <regex>VAL</regex> <!-- (Required) the regex for the mimetype filter -->
            <useMime>false</useMime> <!-- (Optional) [true|false] use the mimetype from the header returned by the server rather than the detected mimetype. Defaults to false. -->
        </mimetype>
    </filter>
</standardProcess>

<localProcess>
    <!-- no additional configuration -->
</localProcess>

<hbaseProcess>
    <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process-->
    <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. -->
    <table>VAL</table> <!-- (Required) The name of the HBase table to process.-->
    <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. -->
</hbaseProcess>

<hdfsProcess>
    <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process-->
    <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. -->
    <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. -->
</hdfsProcess>

Combined configuration helper script

arcomem-framework/configurations/build_conf.sh will create a combined
configuration file with all the modules specified on the command line,
injecting the user-specified HBase and triple store parameters. It uses the
templates under the same directory. Of course, the resulting configuration file
can be further edited.

Wiki: HadoopHBase
Wiki: Statistics
Wiki: TripleStoreConnector