ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

SampleImageProcess

The Sample Image Process Module

The SampleImageProcess module
shows a simple example of how to process images in an offline module. The class
also shows the use of a map-heavy offline module - one that has no reducer but
produces output for every valid row of the HBase table.

The class has one internal class - the SampleImageProcessMapper
- that, for every image in the database, runs a face detector. The
getMapperClass()
returns the class definition for this inner class - Hadoop will instantiate it.
The Reducer returned by the
getReducerClass()
method is the NullReducer.

Most offline processes will need the same sort of data for processing. The required
data is provided by the
getRequiredColumnFamilies()
method. We need the metadata field from the HBase (which contains the headers, mime
types and whatnot) and the actual content that was crawled (the image itself),
so we return a list with those fields named like so:

    final String[] list = { AMResource.METADATA_CF, AMResource.CONTENT_CF };
    return list;

The class shows the best practice for creating counters. An internal enumeration
(called Counters)
defines the counters that the process provides to the Hadoop context.
These are then set using the Hadoop Context#getCounter() method. For example,
when a row of the database is encountered which is not an image we call:

context.getCounter(Counters.NOT_AN_IMAGE).increment(1);

The map method
of the SampleImageProcessMapper
class begins by retrieving the row from the database:

AMResourceVersion resource = wrapper.getResource().getLatestRowVersion(AMResourceVersion.class);

We use the mime type of the row to determine whether we\'re trying to process
an image or not:

// Check if the incoming document appears to be an image
String mime = resource.getDetectedMime();
if( mime == null )
    mime = resource.getMime();
if( mime == null || !mime.startsWith("image/") ) {
    // Increase the "not an image" counter
    context.getCounter( Counters.NOT_AN_IMAGE ).increment(1);
    return;
}

Note that we first try the detected mime type. This is detected during crawling
and stored into the database. If no mime type was detected, then we use the
mime type that was returned by the server when crawled. If the mime type
doesn\'t begin with image/ then we don\'t try to process the resource. Of course,
it\'s still possible it\'s not an image, but we carry on at this point anyway.

We can then use the resource.getContentAsInputStream() to get an input stream
to the bytes of the resource and read in an image (we use OpenIMAJ to read in
the images are we\'re processing using the OpenIMAJ face detector, but you can
use any image loader that accepts an InputStream here).

The processing that takes place from here on in depends entirely on your required
analysis. We simply do a face detection and increase some face counters. If
you wanted to write to the triple store then you can use the data model
to do that. Check the Writing Module Outputs section
for details on how to write the module analysis somewhere useful.

Wiki: DataModel
Wiki: OfflineOutputs
Wiki: SampleModules

ARCOMEM Wiki

Semantic and social web crawling

SampleImageProcess

The Sample Image Process Module

Related