Implementing a new offline analysis module is simply a matter of creating a new subclass of a particular existing class. The existing class you need subclass depends on the type of module. Modules would normally be implemented in the eu.arcomem.framework.offline.processes
package of the offline-analysis-modules project.
Standard offline modules are subclasses of OfflineProcess
. The OfflineProcess
class is typed on the output key and value type of the Map
function (MAP_KEY_OUT
and MAP_VALUE_OUT
), the output key and value type of the reducer (REDUCE_KEY_OUT
and REDUCE_VALUE_OUT
) and the Hadoop org.apache.hadoop.mapreduce.OutputFormat
class (OUTPUT_FORMAT
). For modules that are map-heavy and do not emit via the Hadoop framework from the mapper, a org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
should be used, and the keys and values can be the Null
class.
The following methods of OfflineProcess
must be implemented:
public String[] getRequiredColumnFamilies() public Class<? extends OfflineProcessMapper<MAP_KEY_OUT, MAP_VALUE_OUT>> getMapperClass() public Class<? extends Reducer<MAP_KEY_OUT, MAP_VALUE_OUT, REDUCE_KEY_OUT, REDUCE_VALUE_OUT>> getReducerClass() throws ClassNotFoundException
The getRequiredColumnFamilies()
method must return the HBase column families that the module needs access to. Currently there are two such families: the metadata family (AMResource.METADATA_CF
) and the content family (AMResource.CONTENT_CF
). The implementation of this method should return one or both of the families depending on what information is required. A typical implementation is as follows:
@Override public String[] getRequiredColumnFamilies() { return new String[] {AMResource.METADATA_CF, AMResource.CONTENT_CF}; }
The getMapperClass()
must return the class implementing the Map
function for the module. This must be a subclass of OfflineProcessMapper
. To implement a subclass of OfflineProcessMapper
you need only implement a single method representing a Map
function with a url (the key
) and the document (the resource
):
public void map(Text key, ResultResourceWrapper resource, Context context)
If the module is map-heavy, this method is where most of the work is done. The SampleImageProcess.SampleImageProcessingMapper
class is a good demonstration of a map-heavy module. The URLResourceMapper
demonstrates how a Map
function for a reduce-heavy task might be implemented.
The getReducerClass()
method must return the class implementing the Reduce
function. The returned class must be a subclass of org.apache.hadoop.mapreduce.Reducer
. If the module doesn\'t require a reduce phase, then the NullReducer
class should be returned rather than returning null (which will not work).
OfflineProcess
can also override the setupJob(Job job)
method in order to have the module configure additional parts of the Map-Reduce framework. For example, a module could configure a Combine
function, and set the number of reducers:
@Override public void setupJob(Job job) { super.setupJob(job); job.setCombinerClass(MimeTypeStatsReducer.class); job.setNumReduceTasks(1); }
Both the mapper and reducer classes have setup(Context context)
and cleanup(Context context)
methods that can be overridden to make better use of resources across calls to the Map
or Reduce
functions. A common use-case is for setting up a persistent connection to the triple store that can be re-used throughout the lifetime of the mapper or reducer:
private TripleStoreConnector tsc = null; @Override protected void setup(Context context) throws IOException, InterruptedException { try { tsc = TripleStoreConnector.newConnector( context.getConfiguration() ); } catch (Exception e) { logger.fatal("Error configuring TripleStoreConnector."); System.exit(1); } }
Take a look at the module implementations in the eu.arcomem.framework.offline.processes
package of the offline-analysis-modules project to see some more example uses of the setup(Context context)
and cleanup(Context context)
methods.
Another use of the setup(Context context)
method is for configuring the mapper or reducer. Configuration information provided in the form of key-value pairs to the SingleOfflineProcessRunner
and OfflineProcessRunner
tools is passed automatically into the configuration of the context object which can be accessed by calling context.getConfiguration()
. Calling one of the various get
methods on this configuration object with a key given to the tool will return the relevant value. For example, see this excerpt from the SampleImageProcess.SampleImageProcessingMapper
:
public static final String CONF_MIN_FACE_DET_SIZE = "min.facedetect.size"; private FaceDetector<DetectedFace, FImage> detector; @Override protected void setup(Context context) throws IOException, InterruptedException { int minSize = context.getConfiguration().getInt(CONF_MIN_FACE_DET_SIZE, 80); detector = new HaarCascadeDetector(minSize); }
All local modules are implemented as subclasses of the LocalOfflineProcess
class. The run()
method is the entry point for the actual work performed by the module.
The conf
field from LocalOfflineProcess
provides a convenient way to access the configuration of the module as passed through from the SingleOfflineProcessRunner
or OfflineProcessRunner
tools. A pre-configured connection to the knowledge base in the form of a TripleStoreConnector
can be accessed by called getConnector()
. All local module implementations must have a public one-argument constructor with a LocalProcessConf
parameter so that the framework can configure instances through reflection as required.
Generic HBase and HDFS modules should be implemented as subclasses of the net.internetmemory.mapred.HBaseMapReduceGeneric
and net.internetmemory.mapred.MapReduceSpecGeneric
classes respectively. Please see the Internet Memory documentation on these classes for more information.