To run any modules, you will need to have a Hadoop installation with the configuration of your
cluster on the machine you intend to launch the modules from. If you are using a local module
then it will run on the the machine the module is launched from, whereas all other modules will
run on the cluster. Typically, you will either log-in to one of the ARCOMEM machines at IMF
to run jobs or will have a local cluster setup for development and testing.
The following instructions assume that Hadoop is properly configured and available on your PATH
.
You also need to have an assembled version of ArcomemOffline.jar, which can be build by
performing a mvn assembly:assembly
in the offline-process-modules sub-project.
Running a complete offline phase consists of operating a number of modules in a sequence.
The ARCOMEM OfflineProcessRunner
tool allows the running of module sequences through a
flexible XML-based configuration file. The tool supports a number of powerful features:
Running the tool is simple:
hadoop jar ArcomemOffline.jar OfflineProcessRunner options
Their are currently only two options for the tool:
--config-file (-c) FILE : (Required) The configuration file --dry-run : (Optional) Perform a dry run, validating the configuration and printing the order of the modules without executing them.
The tool performs a number of checks on the configuration to ensure that it is valid.
If any problems are found they will be reported on the command line and the tool will abort.
The configuration for the OfflineProcessRunner
tool is a fairly simple XML file as shown in the example below:
<offlineConfiguration> <tripleStore> <H2RDF> <url>http://test.url</url> <user>jon</user> <table>the-table</table> </H2RDF> </tripleStore> <processes> <standardProcess> <id>image</id> <class>SampleImageProcess</class> <dependencies> <dependency>nlp</dependency> </dependencies> <configuration> <entry> <key>min.facedetect.size</key> <value>80</value> </entry> </configuration> <filter> <mimetype> <regex>^image/.*$</regex> </mimetype> </filter> </standardProcess> <standardProcess> <id>nlp</id> <class>SampleGateProcess</class> </standardProcess> </processes> </offlineConfiguration>
The file defines the connection configuration for the tripleStore (see The Triple Store Connector)
and a list of processes (modules) each with their own configuration. Each process has an identifier
(which must be unique) and can optionally specify dependencies on other modules by specifying their respective identifiers.
Three types of tripleStore configuration are supported: H2RDF
, sesameMemory
and sesameRemote
.
Each configuration has its own options as illustrated below:
<tripleStore> <sesameMemory> <destination>STD_OUT</destination> <!-- (Optional) Where to write RDF to. Either STD_OUT or FILE. Defaults to STD_OUT if not specified. --> <filename>VAL</filename> <!--(Optional) name of RDF output file, if <destination> is FILE. --> </sesameMemory> </tripleStore> <tripleStore> <sesameRemote> <url>VAL</url> <!-- (Required) url of the remote store. --> </sesameRemote> </tripleStore> <tripleStore> <H2RDF> <url>VAL</url> <!-- (Required) table name of the store --> <user>VAL</user> <!-- (Required) url of the store --> <table>VAL</table> <!-- (Required) username for connecting to the store --> </H2RDF> </tripleStore>
See The Triple Store Connector to understand how this information translates to the code.
Offline Modules are specified through <*Process>
blocks. There are separate process blocks for each of the four types of module (standard, local, hbase and hdfs). As mentioned previously, all processes must specify a unique identifier (<id>...</id>
) and can optionally specify dependencies on other modules. All processes must specify the implementation class (<class>...</class>
). Note that when specifying the module class, if the module is in the eu.arcomem.framework.offline.processes
package you need only enter the name; otherwise you must enter the fully qualified classname. In addition, processes can also optionally specify additional key-value configuration pairs which are available to the module implementation:
<configuration> <entry> <key>key1</key> <value>value1</value> </entry> <entry> <key>key2</key> <value>value2</value> </entry> </configuration>
Each type of module also has some additional parameters associated with it as shown below. For clarity, the class
, id
and configuration
entries have been omitted:
<standardProcess> <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process--> <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. --> <table>VAL</table> <!-- (Optional) The name of the HBase table to process. Defaults to warc_contents. --> <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. --> <filter> <!-- (Optional) specify a filter --> <mimetype> <!-- currently there is only a mimetype filter --> <regex>VAL</regex> <!-- (Required) the regex for the mimetype filter --> <useMime>false</useMime> <!-- (Optional) [true|false] use the mimetype from the header returned by the server rather than the detected mimetype. Defaults to false. --> </mimetype> </filter> </standardProcess> <localProcess> <!-- no additional configuration --> </localProcess> <hbaseProcess> <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process--> <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. --> <table>VAL</table> <!-- (Required) The name of the HBase table to process.--> <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. --> </hbaseProcess> <hdfsProcess> <jvmArgs>VAL</jvmArgs> <!-- (Optional) Additional hadoop/jvm arguments for the process--> <forceOverwrite>false</forceOverwrite> <!-- (Optional) [true|false] If true and output exists, remove it. If false and output exists, process will fail. --> <output>VAL</output> <!-- (Optional) Output path/table. Only required for modules that produce output. --> </hdfsProcess>
arcomem-framework/configurations/build_conf.sh
will create a combined
configuration file with all the modules specified on the command line,
injecting the user-specified HBase and triple store parameters. It uses the
templates under the same directory. Of course, the resulting configuration file
can be further edited.
Wiki: HadoopHBase
Wiki: Statistics
Wiki: TripleStoreConnector