ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

OfflineModuleImpl

Implementing Offline Analysis Modules

Implementing a new offline analysis module is simply a matter of creating a new subclass of a particular existing class. The existing class you need subclass depends on the type of module. Modules would normally be implemented in the eu.arcomem.framework.offline.processes package of the offline-analysis-modules project.

Standard Modules

Standard offline modules are subclasses of OfflineProcess. The OfflineProcess class is typed on the output key and value type of the Map function (MAP_KEY_OUT and MAP_VALUE_OUT), the output key and value type of the reducer (REDUCE_KEY_OUT and REDUCE_VALUE_OUT) and the Hadoop org.apache.hadoop.mapreduce.OutputFormat class (OUTPUT_FORMAT). For modules that are map-heavy and do not emit via the Hadoop framework from the mapper, a org.apache.hadoop.mapreduce.lib.output.NullOutputFormat should be used, and the keys and values can be the Null class.

The following methods of OfflineProcess must be implemented:

public String[] getRequiredColumnFamilies()
public Class<? extends OfflineProcessMapper<MAP_KEY_OUT, MAP_VALUE_OUT>> getMapperClass()
public Class<? extends Reducer<MAP_KEY_OUT, MAP_VALUE_OUT, REDUCE_KEY_OUT, REDUCE_VALUE_OUT>> getReducerClass() throws ClassNotFoundException

The getRequiredColumnFamilies() method must return the HBase column families that the module needs access to. Currently there are two such families: the metadata family (AMResource.METADATA_CF) and the content family (AMResource.CONTENT_CF). The implementation of this method should return one or both of the families depending on what information is required. A typical implementation is as follows:

@Override
public String[] getRequiredColumnFamilies() {
    return new String[] {AMResource.METADATA_CF, AMResource.CONTENT_CF};
}

The getMapperClass() must return the class implementing the Map function for the module. This must be a subclass of OfflineProcessMapper. To implement a subclass of OfflineProcessMapper you need only implement a single method representing a Map function with a url (the key) and the document (the resource):

public void map(Text key, ResultResourceWrapper resource, Context context)

If the module is map-heavy, this method is where most of the work is done. The SampleImageProcess.SampleImageProcessingMapper class is a good demonstration of a map-heavy module. The URLResourceMapper demonstrates how a Map function for a reduce-heavy task might be implemented.

The getReducerClass() method must return the class implementing the Reduce function. The returned class must be a subclass of org.apache.hadoop.mapreduce.Reducer. If the module doesn\'t require a reduce phase, then the NullReducer class should be returned rather than returning null (which will not work).

OfflineProcess can also override the setupJob(Job job) method in order to have the module configure additional parts of the Map-Reduce framework. For example, a module could configure a Combine function, and set the number of reducers:

@Override
public void setupJob(Job job) {
    super.setupJob(job);

    job.setCombinerClass(MimeTypeStatsReducer.class);
    job.setNumReduceTasks(1);
}

A note on efficiency and configuration

Both the mapper and reducer classes have setup(Context context) and cleanup(Context context) methods that can be overridden to make better use of resources across calls to the Map or Reduce functions. A common use-case is for setting up a persistent connection to the triple store that can be re-used throughout the lifetime of the mapper or reducer:

private TripleStoreConnector tsc = null;

@Override
protected void setup(Context context) throws IOException, InterruptedException {
    try {
        tsc = TripleStoreConnector.newConnector( context.getConfiguration() );
    } catch (Exception e) {
        logger.fatal("Error configuring TripleStoreConnector.");
        System.exit(1);
    }
}

Take a look at the module implementations in the eu.arcomem.framework.offline.processes package of the offline-analysis-modules project to see some more example uses of the setup(Context context) and cleanup(Context context) methods.

Another use of the setup(Context context) method is for configuring the mapper or reducer. Configuration information provided in the form of key-value pairs to the SingleOfflineProcessRunner and OfflineProcessRunner tools is passed automatically into the configuration of the context object which can be accessed by calling context.getConfiguration(). Calling one of the various get methods on this configuration object with a key given to the tool will return the relevant value. For example, see this excerpt from the SampleImageProcess.SampleImageProcessingMapper:

public static final String CONF_MIN_FACE_DET_SIZE = "min.facedetect.size";
private FaceDetector<DetectedFace, FImage> detector;

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    int minSize = context.getConfiguration().getInt(CONF_MIN_FACE_DET_SIZE, 80);
    detector = new HaarCascadeDetector(minSize);
}

Local Modules

All local modules are implemented as subclasses of the LocalOfflineProcess class. The run() method is the entry point for the actual work performed by the module.

The conf field from LocalOfflineProcess provides a convenient way to access the configuration of the module as passed through from the SingleOfflineProcessRunner or OfflineProcessRunner tools. A pre-configured connection to the knowledge base in the form of a TripleStoreConnector can be accessed by called getConnector(). All local module implementations must have a public one-argument constructor with a LocalProcessConf parameter so that the framework can configure instances through reflection as required.

HBase and HDFS Modules

Generic HBase and HDFS modules should be implemented as subclasses of the net.internetmemory.mapred.HBaseMapReduceGeneric and net.internetmemory.mapred.MapReduceSpecGeneric classes respectively. Please see the Internet Memory documentation on these classes for more information.

Wiki: Architecture
Wiki: MapReduce