Menu

SingleModule

John Arcoman

Running a Single Offline Module

To run any modules, you will need to have a Hadoop installation with the configuration of your cluster on the machine you intend to launch the modules from. If you are using a local module then it will run on the the machine the module is launched from, whereas all other modules will run on the cluster. Typically, you will either log-in to one of the ARCOMEM machines at IMF to run jobs or will have a local cluster setup for development and testing. The following instructions assume that Hadoop is properly configured and available on your PATH. You also need to have an assembled version of ArcomemOffline.jar, which can be build by performing a mvn assembly:assembly in the offline-process-modules sub-project

It is possible to run a single offline module using the SingleOfflineProcessRunn
er tool as follows:

    hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner [extra-hadoop-o

ptions] runner-options [extra-args]

The runner-options options specify information about the module you want to run; these are described in more detail below. The optional extra-hadoop-options options allow you to specify additional information for the Hadoop framework. A common use of the extra-hadoop-options is to set the heap-size for the module process; for example using -Dmapred.child.java.opts="-Xmx500M -Djava.awt.headless=true" will set the maximum heap size of child processes to 500 megabytes an
d set the JVM to headless mode. The extra-args allow additional configuration key-value pairs to be written into the configuration of the module. Modules can access these pairs programatically. Pairs are specified on the command line in the form key=value.

The runner-options vary depending on the type of module being used. A common set of options flags applies to all module types:

    --output (-o) VAL                      : (Optional) Output path/table. Only required for modules that produce output.
    --process-class (-c) CLASS             : (Required) The Java class representing the module.
    --remove-existing-output (-f)          : (Optional) If existing output exists, remove it.
    --triple-store-connector (-tsc)        : (Optional) The triple store connector. Defaults to an in-memory Sesame store.
            [sesameMemory | sesameRemote | H2RDF]

Note that when specifying the module class, if the module is in the eu.arcomem.framework.offline.processes package you need only enter the name; otherwise you must enter the fully qualified classname. If you specify a triple store connector, that too has options depending on the particular connector:

    ***sesameMemory:***
    --rdf-destination [FILE | STD_OUT] : (Optional) output type. Defaults to

STDOUT.
--rdf-out-file VAL : (Optional) name of rdf output file

    ***sesameRemote:***
    --rdfstore-url VAL : (Required) url of the remote store.

    ***H2RDF:***
    --h2rdf-table VAL : (Required) table name of the store
    --h2rdf-url VAL   : (Required) url of the store
    --h2rdf-user VAL  : (Required) username for connecting to the store

Local modules do not have any additional options beyond these base options. S
tandard modules
have the following additional options:

    --filter [mimetype]   : (Optional) name of the filter
    --table-name (-t) VAL : (Optional) The name of the HBase table to process.
                    Defaults to warc_contents.

The filter option allows the module to be applied to only a subset of documents (for example only those with a specific mimetype). The different filters have additional associated options:

***mimetype:***
--regex (-r) VAL : (Required) regular expression to match against mimetype.
--use-header     : (Optional) use the mimetype from the header returned by the
                server rather than the detected mimetype.

HBase modules require the table to be specified:

--table-name (-t) VAL : (Required) The name of the HBase table to process

HDFS modules require a list of locations on the HDFS from which to get their input:

--input (-i) VAL : (Required) The location on the hdfs of the mapper input(s)

Examples

The following command would run the MimeTypeStatsProcess module over all the documents in the default (warc_contents) table, and output the results to a directory called mimetypes in the root of the HDFS:

hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes

Once the command has started, it will print information about its progress towards completion on the terminal:

homer:target jsh2$ hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes
2012-05-16 14:49:27.141 java[46548:1903] Unable to load realm info from SCDynamicStore
12/05/16 14:49:27 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.3-1240972, built on 02/06/2012 10:48 GMT
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:host.name=homer.ecs.soton.ac.uk
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_31
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Apple Inc.
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.home=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/local/hadoop-0.20.2-cdh3u3/bin/../conf:/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/lib/tools.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/..:/usr/local/hadoop-0.20.2-cdh3u3/bin/../hadoop-core-0.20.2-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/ant-contrib-1.0b3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/aspectjrt-1.6.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/aspectjtools-1.6.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-cli-1.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-codec-1.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-daemon-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-el-1.0.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-httpclient-3.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-lang-2.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-logging-1.0.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-logging-api-1.0.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-net-1.4.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/core-3.1.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/event-publish-3.7.3-shaded.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/guava-r09-jarjar.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-capacity-scheduler-0.20.2-cdh3u1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-fairscheduler-0.20.2-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-lzo-0.4.15.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hbase.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hsqldb-1.8.0.10.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hue-auth-plugin-3.7.3.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hue-plugins-1.2.0-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-core-asl-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-core-asl-1.5.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-mapper-asl-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-mapper-asl-1.5.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jasper-compiler-5.5.12.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jasper-runtime-5.5.12.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jets3t-0.6.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-util-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsch-0.1.42.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/junit-4.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/kfs-0.2.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/log4j-1.2.15.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/mockito-all-1.8.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/oro-2.0.8.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/servlet-api-2.5-20081211.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/servlet-api-2.5-6.1.14.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/slf4j-api-1.4.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/slf4j-log4j12-1.4.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/tt-instrumentation-3.7.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/xmlenc-0.52.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/zookeeper.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsp-2.1/jsp-2.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsp-2.1/jsp-api-2.1.jar
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/native/Mac_OS_X-x86_64-64
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/var/folders/n0/92v5cvfj16s7kp1_jh8yq2zc0000gn/T/
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.compiler=N/A
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.name=Mac OS X
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.arch=x86_64
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.version=10.7.3
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.name=jsh2
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.home=/Users/jsh2
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.dir=/Users/jsh2/Work/arcomem/arcomem-framework/offline-analysis-modules/target
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=seurat.ecs.soton.ac.uk:2181 sessionTimeout=180000 watcher=hconnection
12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Opening socket connection to server /152.78.65.162:2181
12/05/16 14:49:27 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration.
12/05/16 14:49:27 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 46548@homer.ecs.soton.ac.uk
12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Socket connection established to seurat.ecs.soton.ac.uk/152.78.65.162:2181, initiating session
12/05/16 14:49:27 WARN zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable
12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Session establishment complete on server seurat.ecs.soton.ac.uk/152.78.65.162:2181, sessionid = 0x13727c836ec1b9f, negotiated timeout = 40000
12/05/16 14:49:27 INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x13727c836ec1b9f
12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Session: 0x13727c836ec1b9f closed
12/05/16 14:49:27 INFO zookeeper.ClientCnxn: EventThread shut down
Setting up compression for intermediate results..
Executing task: MimeTypeStatsProcess.
Output will be found in /mimetypes
12/05/16 14:49:30 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=seurat.ecs.soton.ac.uk:2181 sessionTimeout=180000 watcher=hconnection
12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Opening socket connection to server /152.78.65.162:2181
12/05/16 14:49:30 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 46548@homer.ecs.soton.ac.uk
12/05/16 14:49:30 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration.
12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Socket connection established to seurat.ecs.soton.ac.uk/152.78.65.162:2181, initiating session
12/05/16 14:49:30 WARN zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable
12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Session establishment complete on server seurat.ecs.soton.ac.uk/152.78.65.162:2181, sessionid = 0x13727c836ec1ba0, negotiated timeout = 40000
12/05/16 14:49:31 INFO mapred.JobClient: Running job: job_201205101503_0007
12/05/16 14:49:32 INFO mapred.JobClient:  map 0% reduce 0%
12/05/16 14:49:55 INFO mapred.JobClient:  map 1% reduce 0%
12/05/16 14:49:56 INFO mapred.JobClient:  map 2% reduce 0%
12/05/16 14:49:57 INFO mapred.JobClient:  map 3% reduce 0%
12/05/16 14:49:58 INFO mapred.JobClient:  map 4% reduce 0%
12/05/16 14:50:00 INFO mapred.JobClient:  map 5% reduce 0%
12/05/16 14:50:01 INFO mapred.JobClient:  map 6% reduce 0%
12/05/16 14:50:07 INFO mapred.JobClient:  map 7% reduce 0%
12/05/16 14:50:08 INFO mapred.JobClient:  map 9% reduce 0%
12/05/16 14:50:09 INFO mapred.JobClient:  map 10% reduce 0%
12/05/16 14:50:11 INFO mapred.JobClient:  map 11% reduce 0%
12/05/16 14:50:13 INFO mapred.JobClient:  map 12% reduce 0%
12/05/16 14:50:15 INFO mapred.JobClient:  map 13% reduce 0%
12/05/16 14:50:17 INFO mapred.JobClient:  map 13% reduce 2%
12/05/16 14:50:19 INFO mapred.JobClient:  map 15% reduce 2%
12/05/16 14:50:20 INFO mapred.JobClient:  map 16% reduce 2%
12/05/16 14:50:22 INFO mapred.JobClient:  map 17% reduce 2%
12/05/16 14:50:23 INFO mapred.JobClient:  map 17% reduce 3%
12/05/16 14:50:25 INFO mapred.JobClient:  map 18% reduce 3%
12/05/16 14:50:26 INFO mapred.JobClient:  map 19% reduce 4%
12/05/16 14:50:27 INFO mapred.JobClient:  map 20% reduce 4%
12/05/16 14:50:29 INFO mapred.JobClient:  map 20% reduce 5%
12/05/16 14:50:31 INFO mapred.JobClient:  map 21% reduce 5%
12/05/16 14:50:32 INFO mapred.JobClient:  map 22% reduce 5%
12/05/16 14:50:34 INFO mapred.JobClient:  map 22% reduce 7%
12/05/16 14:50:35 INFO mapred.JobClient:  map 23% reduce 7%
12/05/16 14:50:37 INFO mapred.JobClient:  map 24% reduce 7%
12/05/16 14:50:41 INFO mapred.JobClient:  map 25% reduce 7%
12/05/16 14:50:44 INFO mapred.JobClient:  map 26% reduce 8%
12/05/16 14:50:46 INFO mapred.JobClient:  map 27% reduce 8%
12/05/16 14:50:49 INFO mapred.JobClient:  map 28% reduce 8%
12/05/16 14:50:50 INFO mapred.JobClient:  map 28% reduce 9%
12/05/16 14:50:51 INFO mapred.JobClient:  map 29% reduce 9%
12/05/16 14:50:55 INFO mapred.JobClient:  map 30% reduce 9%
12/05/16 14:50:57 INFO mapred.JobClient:  map 31% reduce 9%
12/05/16 14:50:59 INFO mapred.JobClient:  map 32% reduce 10%
12/05/16 14:51:00 INFO mapred.JobClient:  map 33% reduce 10%
12/05/16 14:51:02 INFO mapred.JobClient:  map 34% reduce 10%
12/05/16 14:51:08 INFO mapred.JobClient:  map 35% reduce 10%
12/05/16 14:51:09 INFO mapred.JobClient:  map 35% reduce 11%
12/05/16 14:51:16 INFO mapred.JobClient:  map 36% reduce 11%
12/05/16 14:51:27 INFO mapred.JobClient:  map 37% reduce 11%
12/05/16 14:51:43 INFO mapred.JobClient:  map 38% reduce 11%
12/05/16 14:51:44 INFO mapred.JobClient:  map 39% reduce 11%
12/05/16 14:51:45 INFO mapred.JobClient:  map 39% reduce 12%
12/05/16 14:51:46 INFO mapred.JobClient:  map 40% reduce 12%
12/05/16 14:51:47 INFO mapred.JobClient:  map 41% reduce 13%
12/05/16 14:51:49 INFO mapred.JobClient:  map 42% reduce 13%
12/05/16 14:51:50 INFO mapred.JobClient:  map 43% reduce 13%
12/05/16 14:51:52 INFO mapred.JobClient:  map 44% reduce 13%
12/05/16 14:51:53 INFO mapred.JobClient:  map 45% reduce 13%
12/05/16 14:51:57 INFO mapred.JobClient:  map 45% reduce 14%
12/05/16 14:52:03 INFO mapred.JobClient:  map 45% reduce 15%
12/05/16 14:52:05 INFO mapred.JobClient:  map 46% reduce 15%
12/05/16 14:52:07 INFO mapred.JobClient:  map 47% reduce 15%
12/05/16 14:52:10 INFO mapred.JobClient:  map 48% reduce 15%
12/05/16 14:52:12 INFO mapred.JobClient:  map 49% reduce 15%
12/05/16 14:52:15 INFO mapred.JobClient:  map 50% reduce 15%
12/05/16 14:52:18 INFO mapred.JobClient:  map 50% reduce 16%
12/05/16 14:52:21 INFO mapred.JobClient:  map 51% reduce 16%
12/05/16 14:52:23 INFO mapred.JobClient:  map 52% reduce 16%
12/05/16 14:52:27 INFO mapred.JobClient:  map 52% reduce 17%
12/05/16 14:52:29 INFO mapred.JobClient:  map 53% reduce 17%
12/05/16 14:52:32 INFO mapred.JobClient:  map 54% reduce 17%
12/05/16 14:52:33 INFO mapred.JobClient:  map 54% reduce 18%
12/05/16 14:52:35 INFO mapred.JobClient:  map 55% reduce 18%
12/05/16 14:52:38 INFO mapred.JobClient:  map 56% reduce 18%
12/05/16 14:52:40 INFO mapred.JobClient:  map 57% reduce 18%
12/05/16 14:52:43 INFO mapred.JobClient:  map 58% reduce 18%
12/05/16 14:52:45 INFO mapred.JobClient:  map 59% reduce 18%
12/05/16 14:52:47 INFO mapred.JobClient:  map 60% reduce 18%
12/05/16 14:52:48 INFO mapred.JobClient:  map 61% reduce 19%
12/05/16 14:52:51 INFO mapred.JobClient:  map 61% reduce 20%
12/05/16 14:52:53 INFO mapred.JobClient:  map 62% reduce 20%
12/05/16 14:52:55 INFO mapred.JobClient:  map 63% reduce 20%
12/05/16 14:52:58 INFO mapred.JobClient:  map 64% reduce 20%
12/05/16 14:53:00 INFO mapred.JobClient:  map 65% reduce 20%
12/05/16 14:53:03 INFO mapred.JobClient:  map 65% reduce 21%
12/05/16 14:53:05 INFO mapred.JobClient:  map 66% reduce 21%
12/05/16 14:53:08 INFO mapred.JobClient:  map 67% reduce 21%
12/05/16 14:53:11 INFO mapred.JobClient:  map 68% reduce 21%
12/05/16 14:53:12 INFO mapred.JobClient:  map 68% reduce 22%
12/05/16 14:53:13 INFO mapred.JobClient:  map 69% reduce 22%
12/05/16 14:53:18 INFO mapred.JobClient:  map 70% reduce 23%
12/05/16 14:53:19 INFO mapred.JobClient:  map 71% reduce 23%
12/05/16 14:53:21 INFO mapred.JobClient:  map 72% reduce 23%
12/05/16 14:53:23 INFO mapred.JobClient:  map 73% reduce 23%
12/05/16 14:53:27 INFO mapred.JobClient:  map 75% reduce 23%
12/05/16 14:53:28 INFO mapred.JobClient:  map 75% reduce 24%
12/05/16 14:53:30 INFO mapred.JobClient:  map 76% reduce 24%
12/05/16 14:53:33 INFO mapred.JobClient:  map 76% reduce 25%
12/05/16 14:53:35 INFO mapred.JobClient:  map 77% reduce 25%
12/05/16 14:53:36 INFO mapred.JobClient:  map 78% reduce 25%
12/05/16 14:53:38 INFO mapred.JobClient:  map 79% reduce 25%
12/05/16 14:53:40 INFO mapred.JobClient:  map 80% reduce 25%
12/05/16 14:53:41 INFO mapred.JobClient:  map 81% reduce 25%
12/05/16 14:53:42 INFO mapred.JobClient:  map 82% reduce 26%
12/05/16 14:53:47 INFO mapred.JobClient:  map 83% reduce 26%
12/05/16 14:53:48 INFO mapred.JobClient:  map 84% reduce 27%
12/05/16 14:53:50 INFO mapred.JobClient:  map 85% reduce 27%
12/05/16 14:53:52 INFO mapred.JobClient:  map 86% reduce 27%
12/05/16 14:53:56 INFO mapred.JobClient:  map 87% reduce 27%
12/05/16 14:53:57 INFO mapred.JobClient:  map 87% reduce 28%
12/05/16 14:54:00 INFO mapred.JobClient:  map 88% reduce 28%
12/05/16 14:54:03 INFO mapred.JobClient:  map 89% reduce 29%
12/05/16 14:54:07 INFO mapred.JobClient:  map 90% reduce 29%
12/05/16 14:54:10 INFO mapred.JobClient:  map 91% reduce 29%
12/05/16 14:54:12 INFO mapred.JobClient:  map 91% reduce 30%
12/05/16 14:54:17 INFO mapred.JobClient:  map 92% reduce 30%
12/05/16 14:54:19 INFO mapred.JobClient:  map 93% reduce 30%
12/05/16 14:54:20 INFO mapred.JobClient:  map 94% reduce 30%
12/05/16 14:54:24 INFO mapred.JobClient:  map 95% reduce 30%
12/05/16 14:54:28 INFO mapred.JobClient:  map 95% reduce 31%
12/05/16 14:54:30 INFO mapred.JobClient:  map 96% reduce 31%
12/05/16 14:54:33 INFO mapred.JobClient:  map 96% reduce 32%
12/05/16 14:54:35 INFO mapred.JobClient:  map 97% reduce 32%
12/05/16 14:54:50 INFO mapred.JobClient:  map 98% reduce 32%
12/05/16 14:55:00 INFO mapred.JobClient:  map 99% reduce 32%
12/05/16 14:55:03 INFO mapred.JobClient:  map 99% reduce 33%
12/05/16 14:55:55 INFO mapred.JobClient:  map 100% reduce 33%
12/05/16 14:56:01 INFO mapred.JobClient:  map 100% reduce 100%
12/05/16 14:56:02 INFO mapred.JobClient: Job complete: job_201205101503_0007
12/05/16 14:56:02 INFO mapred.JobClient: Counters: 27
12/05/16 14:56:02 INFO mapred.JobClient:   Job Counters
12/05/16 14:56:02 INFO mapred.JobClient:     Launched reduce tasks=1
12/05/16 14:56:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=19618198
12/05/16 14:56:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/16 14:56:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/05/16 14:56:02 INFO mapred.JobClient:     Rack-local map tasks=130
12/05/16 14:56:02 INFO mapred.JobClient:     Launched map tasks=539
12/05/16 14:56:02 INFO mapred.JobClient:     Data-local map tasks=409
12/05/16 14:56:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=361163
12/05/16 14:56:02 INFO mapred.JobClient:   FileSystemCounters
12/05/16 14:56:02 INFO mapred.JobClient:     FILE_BYTES_READ=35224
12/05/16 14:56:02 INFO mapred.JobClient:     HDFS_BYTES_READ=169501
12/05/16 14:56:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=33915785
12/05/16 14:56:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=4860
12/05/16 14:56:02 INFO mapred.JobClient:   Map-Reduce Framework
12/05/16 14:56:02 INFO mapred.JobClient:     Map input records=8757931
12/05/16 14:56:02 INFO mapred.JobClient:     Reduce shuffle bytes=148813
12/05/16 14:56:02 INFO mapred.JobClient:     Spilled Records=16522
12/05/16 14:56:02 INFO mapred.JobClient:     Map output bytes=148765748
12/05/16 14:56:02 INFO mapred.JobClient:     CPU time spent (ms)=8271190
12/05/16 14:56:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=103280607232
12/05/16 14:56:02 INFO mapred.JobClient:     Combine input records=5812783
12/05/16 14:56:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=169501
12/05/16 14:56:02 INFO mapred.JobClient:     Reduce input records=8261
12/05/16 14:56:02 INFO mapred.JobClient:     Reduce input groups=151
12/05/16 14:56:02 INFO mapred.JobClient:     Combine output records=8261
12/05/16 14:56:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=139026960384
12/05/16 14:56:02 INFO mapred.JobClient:     Reduce output records=151
12/05/16 14:56:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=354349391872
12/05/16 14:56:02 INFO mapred.JobClient:     Map output records=5812783

Once the command has finished, you could look at the resulting mime distribution:

hadoop dfs -cat /mimetypes/part-r-00000

On the financial crisis crawl, this would produce something like this:

homer:trunk jsh2$ hadoop dfs -cat /mimetypes/part-r-00000
2012-05-16 15:14:35.659 java[46907:1903] Unable to load realm info from SCDynamicStore
12/05/16 15:14:35 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
2011    2
;?php bloginfo('html_type'); ?>; charset=utf-8  84
=iso-8859-1 1
application/atom+xml    143
application/java-vm 63
application/msword  530
application/octet-stream    21
application/ogg 23
application/pdf 21438
application/rdf+xml 1217
application/rss+xml 303938
application/rtf 8
application/vnd.ms-excel    14
application/vnd.ms-powerpoint   35
application/vnd.oasis.opendocument.text 1
application/vnd.openxmlformats-officedocument.presentationml.presentation   5
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet   3
application/vnd.openxmlformats-officedocument.wordprocessingml.document3
application/vnd.rn-realmedia    7
application/x-123   105
application/x-bibtex-text-file  1
application/x-bzip  1
application/x-elc   468
application/x-font-ttf  1149
application/x-font-type1    11
application/x-gzip  937
application/x-rar-compressed    2
application/x-shockwave-flash   44097
application/x-tar   9
application/x-tika-msoffice 2
application/xhtml+xml   227881
application/xhtml+xml charset=shift_jis 1
application/xhtml+xml; charset='utf-8'  4
application/xhtml+xml; charset=iso-8859-1   517
application/xhtml+xml; charset=shift_jis    4601
application/xhtml+xml; charset=utf-8    1656
application/xhtml; charset=utf-8    3
application/xml 50504
application/xslt+xml    28
application/zip 88
audio/midi  4
audio/mpeg  7875
audio/ogg   38
audio/x-ms-wma  4
audio/x-wav 46
deutsche welle blogs; charset=utf-8 11
employment, litigation, discrimination, sexual harrasment, contracts, torts, unpaid wages, overtime federal, civil service, wrongful termination, civil rights, fair labor, standards severence 2
generation change; charset=utf-8    44
image/gif   537833
image/jpeg  1369888
image/png   435659
image/svg+xml   926
image/tiff  59
image/x-icon    7032
image/x-ms-bmp  1175
image/x-xcf 1
message/news    2
message/rfc822  2
noindex, nofollow   1
pieter wagemans,flower painting,flower paintings,oil painting,bloemen schilderij,flowers,   1
tect/html charset=utf-8 1
text-html; charset=iso-8859-1   1
text-html; charset=utf-8    16
text-html; charset=windows-1252 20
text.php; charset=utf-8 1
text/ht ml;charset=utf-8    2
text/html   341621
text/html charset=utf-8 96
text/html, charset=iso-8859-1   1
text/html, utf-8    1
text/html/  5
text/html; <link rel=   2
text/html; >charset=iso-8859-1  6
text/html; _iso=    1
text/html; charset= 11
text/html; charset="'                    + this.input_enc                    + '"   2
text/html; charset="\"'                    + this.input_enc                    + '\""   1
text/html; charset="\"+this.dataencoding+\""    2
text/html; charset="\"\\\"+this.dataencoding+\\\"\""    1
text/html; charset="\"iso 8859-1\"" 1
text/html; charset="\"utf-830 days to save 300 acres. chip in before it's too late\""   1
text/html; charset="\[\(modx_charset\)\]"   3
text/html; charset="charset\=utf-8" 1
text/html; charset="iso 8859-1" 2
text/html; charset="iso-10646\/unicode" 1
text/html; charset="utf-830 days to save 300 acres. chip in before it's too late"   1
text/html; charset='+this.request_params.encoding+' 1
text/html; charset='windows-1252'   1
text/html; charset=8859-1   3
text/html; charset=__encoding__ 13
text/html; charset=ansi 2
text/html; charset=ansi_x3.110-1983 2
text/html; charset=big5 15
text/html; charset=cp1251   1
text/html; charset=euc  2
text/html; charset=euc-jp   19126
text/html; charset=euc-kr   7
text/html; charset=gb2312   207
text/html; charset=gb_2312-80   3
text/html; charset=gbk  48
text/html; charset=iso-2022-jp  3
text/html; charset=iso-8859 1
text/html; charset=iso-8859-1   156099
text/html; charset=iso-8859-13  7
text/html; charset=iso-8859-15  31484
text/html; charset=iso-8859-16  1
text/html; charset=iso-8859-2   38
text/html; charset=iso-8859-7   4384
text/html; charset=iso-8859-8   6
text/html; charset=iso-8859-9   33
text/html; charset=iso8859-1    189
text/html; charset=iso_8859-1   1
text/html; charset=koi8-r   1
text/html; charset=latin-1  1
text/html; charset=latin1   11
text/html; charset=latin9   1
text/html; charset=macintosh    3
text/html; charset=shift-jis    94
text/html; charset=shift_jis    4472
text/html; charset=unicode  1
text/html; charset=us-ascii 111
text/html; charset=utf-8    2033264
text/html; charset=utf8 38
text/html; charset=windows-1250 27
text/html; charset=windows-1251 212
text/html; charset=windows-1252 1538
text/html; charset=windows-1253 2
text/html; charset=windows-1254 10
text/html; charset=windows-1255 3868
text/html; charset=windows-1256 64
text/html; charset=windows-874  1
text/html; charset=x-mac-greek  1
text/html; charset={charset}    74
text/html; en=  1
text/html; iso-8859-1=  1878
text/html; utf-8=   120
text/htmlcharset=iso-8859-1 1
text/javascript; charset=utf-8  1
text/plain  174627
text/xhtml; charset=iso-8859-1  24
text/xhtml; charset=utf-8   16230
text/xml; charset=utf-8 2
text\/html; charset=utf-8   1
text\html; charset=iso-8859-1   1
utf-8   12
video/mpeg  2
video/quicktime 2055
video/x-flv 128
video/x-ms-asf  10
video/x-ms-wmv  201
video/x-msvideo 13

We could also run the previous module with a mimetype filter specified:

hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes --filter mimetype -r "^video/.*$"

This results in a much shorter output file:

homer:trunk jsh2$ hadoop dfs -cat /mimetypes/part-r-00000
2012-05-16 15:26:43.096 java[47058:1903] Unable to load realm info from SCDynamicStore
12/05/16 15:26:43 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
video/mpeg  2
video/quicktime 2055
video/x-flv 128
video/x-ms-asf  10
video/x-ms-wmv  201
video/x-msvideo 13

Related

Wiki: HadoopHBase

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.