To run any modules, you will need to have a Hadoop installation with the configuration of your cluster on the machine you intend to launch the modules from. If you are using a local module then it will run on the the machine the module is launched from, whereas all other modules will run on the cluster. Typically, you will either log-in to one of the ARCOMEM machines at IMF to run jobs or will have a local cluster setup for development and testing. The following instructions assume that Hadoop is properly configured and available on your PATH
. You also need to have an assembled version of ArcomemOffline.jar, which can be build by performing a mvn assembly:assembly
in the offline-process-modules sub-project
It is possible to run a single offline module using the SingleOfflineProcessRunn
er tool as follows:
hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner [extra-hadoop-o
ptions] runner-options [extra-args]
The runner-options
options specify information about the module you want to run; these are described in more detail below. The optional extra-hadoop-options
options allow you to specify additional information for the Hadoop framework. A common use of the extra-hadoop-options
is to set the heap-size for the module process; for example using -Dmapred.child.java.opts="-Xmx500M -Djava.awt.headless=true"
will set the maximum heap size of child processes to 500 megabytes an
d set the JVM to headless mode. The extra-args
allow additional configuration key-value pairs to be written into the configuration of the module. Modules can access these pairs programatically. Pairs are specified on the command line in the form key=value
.
The runner-options
vary depending on the type of module being used. A common set of options flags applies to all module types:
--output (-o) VAL : (Optional) Output path/table. Only required for modules that produce output. --process-class (-c) CLASS : (Required) The Java class representing the module. --remove-existing-output (-f) : (Optional) If existing output exists, remove it. --triple-store-connector (-tsc) : (Optional) The triple store connector. Defaults to an in-memory Sesame store. [sesameMemory | sesameRemote | H2RDF]
Note that when specifying the module class, if the module is in the eu.arcomem.framework.offline.processes
package you need only enter the name; otherwise you must enter the fully qualified classname. If you specify a triple store connector, that too has options depending on the particular connector:
***sesameMemory:*** --rdf-destination [FILE | STD_OUT] : (Optional) output type. Defaults to
STDOUT.
--rdf-out-file VAL : (Optional) name of rdf output file
***sesameRemote:*** --rdfstore-url VAL : (Required) url of the remote store. ***H2RDF:*** --h2rdf-table VAL : (Required) table name of the store --h2rdf-url VAL : (Required) url of the store --h2rdf-user VAL : (Required) username for connecting to the store
Local modules do not have any additional options beyond these base options. S
tandard modules have the following additional options:
--filter [mimetype] : (Optional) name of the filter --table-name (-t) VAL : (Optional) The name of the HBase table to process. Defaults to warc_contents.
The filter option allows the module to be applied to only a subset of documents (for example only those with a specific mimetype). The different filters have additional associated options:
***mimetype:*** --regex (-r) VAL : (Required) regular expression to match against mimetype. --use-header : (Optional) use the mimetype from the header returned by the server rather than the detected mimetype.
HBase modules require the table to be specified:
--table-name (-t) VAL : (Required) The name of the HBase table to process
HDFS modules require a list of locations on the HDFS from which to get their input:
--input (-i) VAL : (Required) The location on the hdfs of the mapper input(s)
The following command would run the MimeTypeStatsProcess
module over all the documents in the default (warc_contents
) table, and output the results to a directory called mimetypes
in the root of the HDFS:
hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes
Once the command has started, it will print information about its progress towards completion on the terminal:
homer:target jsh2$ hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes 2012-05-16 14:49:27.141 java[46548:1903] Unable to load realm info from SCDynamicStore 12/05/16 14:49:27 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.3-1240972, built on 02/06/2012 10:48 GMT 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:host.name=homer.ecs.soton.ac.uk 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_31 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Apple Inc. 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.home=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/local/hadoop-0.20.2-cdh3u3/bin/../conf:/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/lib/tools.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/..:/usr/local/hadoop-0.20.2-cdh3u3/bin/../hadoop-core-0.20.2-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/ant-contrib-1.0b3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/aspectjrt-1.6.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/aspectjtools-1.6.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-cli-1.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-codec-1.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-daemon-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-el-1.0.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-httpclient-3.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-lang-2.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-logging-1.0.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-logging-api-1.0.4.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/commons-net-1.4.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/core-3.1.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/event-publish-3.7.3-shaded.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/guava-r09-jarjar.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-capacity-scheduler-0.20.2-cdh3u1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-fairscheduler-0.20.2-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hadoop-lzo-0.4.15.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hbase.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hsqldb-1.8.0.10.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hue-auth-plugin-3.7.3.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/hue-plugins-1.2.0-cdh3u3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-core-asl-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-core-asl-1.5.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-mapper-asl-1.0.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jackson-mapper-asl-1.5.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jasper-compiler-5.5.12.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jasper-runtime-5.5.12.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jets3t-0.6.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jetty-util-6.1.26.cloudera.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsch-0.1.42.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/junit-4.5.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/kfs-0.2.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/log4j-1.2.15.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/mockito-all-1.8.2.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/oro-2.0.8.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/servlet-api-2.5-20081211.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/servlet-api-2.5-6.1.14.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/slf4j-api-1.4.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/slf4j-log4j12-1.4.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/tt-instrumentation-3.7.3.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/xmlenc-0.52.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/zookeeper.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsp-2.1/jsp-2.1.jar:/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/jsp-2.1/jsp-api-2.1.jar 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop-0.20.2-cdh3u3/bin/../lib/native/Mac_OS_X-x86_64-64 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/var/folders/n0/92v5cvfj16s7kp1_jh8yq2zc0000gn/T/ 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:java.compiler=N/A 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.name=Mac OS X 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.arch=x86_64 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:os.version=10.7.3 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.name=jsh2 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.home=/Users/jsh2 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Client environment:user.dir=/Users/jsh2/Work/arcomem/arcomem-framework/offline-analysis-modules/target 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=seurat.ecs.soton.ac.uk:2181 sessionTimeout=180000 watcher=hconnection 12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Opening socket connection to server /152.78.65.162:2181 12/05/16 14:49:27 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration. 12/05/16 14:49:27 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 46548@homer.ecs.soton.ac.uk 12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Socket connection established to seurat.ecs.soton.ac.uk/152.78.65.162:2181, initiating session 12/05/16 14:49:27 WARN zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable 12/05/16 14:49:27 INFO zookeeper.ClientCnxn: Session establishment complete on server seurat.ecs.soton.ac.uk/152.78.65.162:2181, sessionid = 0x13727c836ec1b9f, negotiated timeout = 40000 12/05/16 14:49:27 INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x13727c836ec1b9f 12/05/16 14:49:27 INFO zookeeper.ZooKeeper: Session: 0x13727c836ec1b9f closed 12/05/16 14:49:27 INFO zookeeper.ClientCnxn: EventThread shut down Setting up compression for intermediate results.. Executing task: MimeTypeStatsProcess. Output will be found in /mimetypes 12/05/16 14:49:30 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=seurat.ecs.soton.ac.uk:2181 sessionTimeout=180000 watcher=hconnection 12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Opening socket connection to server /152.78.65.162:2181 12/05/16 14:49:30 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 46548@homer.ecs.soton.ac.uk 12/05/16 14:49:30 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration. 12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Socket connection established to seurat.ecs.soton.ac.uk/152.78.65.162:2181, initiating session 12/05/16 14:49:30 WARN zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable 12/05/16 14:49:30 INFO zookeeper.ClientCnxn: Session establishment complete on server seurat.ecs.soton.ac.uk/152.78.65.162:2181, sessionid = 0x13727c836ec1ba0, negotiated timeout = 40000 12/05/16 14:49:31 INFO mapred.JobClient: Running job: job_201205101503_0007 12/05/16 14:49:32 INFO mapred.JobClient: map 0% reduce 0% 12/05/16 14:49:55 INFO mapred.JobClient: map 1% reduce 0% 12/05/16 14:49:56 INFO mapred.JobClient: map 2% reduce 0% 12/05/16 14:49:57 INFO mapred.JobClient: map 3% reduce 0% 12/05/16 14:49:58 INFO mapred.JobClient: map 4% reduce 0% 12/05/16 14:50:00 INFO mapred.JobClient: map 5% reduce 0% 12/05/16 14:50:01 INFO mapred.JobClient: map 6% reduce 0% 12/05/16 14:50:07 INFO mapred.JobClient: map 7% reduce 0% 12/05/16 14:50:08 INFO mapred.JobClient: map 9% reduce 0% 12/05/16 14:50:09 INFO mapred.JobClient: map 10% reduce 0% 12/05/16 14:50:11 INFO mapred.JobClient: map 11% reduce 0% 12/05/16 14:50:13 INFO mapred.JobClient: map 12% reduce 0% 12/05/16 14:50:15 INFO mapred.JobClient: map 13% reduce 0% 12/05/16 14:50:17 INFO mapred.JobClient: map 13% reduce 2% 12/05/16 14:50:19 INFO mapred.JobClient: map 15% reduce 2% 12/05/16 14:50:20 INFO mapred.JobClient: map 16% reduce 2% 12/05/16 14:50:22 INFO mapred.JobClient: map 17% reduce 2% 12/05/16 14:50:23 INFO mapred.JobClient: map 17% reduce 3% 12/05/16 14:50:25 INFO mapred.JobClient: map 18% reduce 3% 12/05/16 14:50:26 INFO mapred.JobClient: map 19% reduce 4% 12/05/16 14:50:27 INFO mapred.JobClient: map 20% reduce 4% 12/05/16 14:50:29 INFO mapred.JobClient: map 20% reduce 5% 12/05/16 14:50:31 INFO mapred.JobClient: map 21% reduce 5% 12/05/16 14:50:32 INFO mapred.JobClient: map 22% reduce 5% 12/05/16 14:50:34 INFO mapred.JobClient: map 22% reduce 7% 12/05/16 14:50:35 INFO mapred.JobClient: map 23% reduce 7% 12/05/16 14:50:37 INFO mapred.JobClient: map 24% reduce 7% 12/05/16 14:50:41 INFO mapred.JobClient: map 25% reduce 7% 12/05/16 14:50:44 INFO mapred.JobClient: map 26% reduce 8% 12/05/16 14:50:46 INFO mapred.JobClient: map 27% reduce 8% 12/05/16 14:50:49 INFO mapred.JobClient: map 28% reduce 8% 12/05/16 14:50:50 INFO mapred.JobClient: map 28% reduce 9% 12/05/16 14:50:51 INFO mapred.JobClient: map 29% reduce 9% 12/05/16 14:50:55 INFO mapred.JobClient: map 30% reduce 9% 12/05/16 14:50:57 INFO mapred.JobClient: map 31% reduce 9% 12/05/16 14:50:59 INFO mapred.JobClient: map 32% reduce 10% 12/05/16 14:51:00 INFO mapred.JobClient: map 33% reduce 10% 12/05/16 14:51:02 INFO mapred.JobClient: map 34% reduce 10% 12/05/16 14:51:08 INFO mapred.JobClient: map 35% reduce 10% 12/05/16 14:51:09 INFO mapred.JobClient: map 35% reduce 11% 12/05/16 14:51:16 INFO mapred.JobClient: map 36% reduce 11% 12/05/16 14:51:27 INFO mapred.JobClient: map 37% reduce 11% 12/05/16 14:51:43 INFO mapred.JobClient: map 38% reduce 11% 12/05/16 14:51:44 INFO mapred.JobClient: map 39% reduce 11% 12/05/16 14:51:45 INFO mapred.JobClient: map 39% reduce 12% 12/05/16 14:51:46 INFO mapred.JobClient: map 40% reduce 12% 12/05/16 14:51:47 INFO mapred.JobClient: map 41% reduce 13% 12/05/16 14:51:49 INFO mapred.JobClient: map 42% reduce 13% 12/05/16 14:51:50 INFO mapred.JobClient: map 43% reduce 13% 12/05/16 14:51:52 INFO mapred.JobClient: map 44% reduce 13% 12/05/16 14:51:53 INFO mapred.JobClient: map 45% reduce 13% 12/05/16 14:51:57 INFO mapred.JobClient: map 45% reduce 14% 12/05/16 14:52:03 INFO mapred.JobClient: map 45% reduce 15% 12/05/16 14:52:05 INFO mapred.JobClient: map 46% reduce 15% 12/05/16 14:52:07 INFO mapred.JobClient: map 47% reduce 15% 12/05/16 14:52:10 INFO mapred.JobClient: map 48% reduce 15% 12/05/16 14:52:12 INFO mapred.JobClient: map 49% reduce 15% 12/05/16 14:52:15 INFO mapred.JobClient: map 50% reduce 15% 12/05/16 14:52:18 INFO mapred.JobClient: map 50% reduce 16% 12/05/16 14:52:21 INFO mapred.JobClient: map 51% reduce 16% 12/05/16 14:52:23 INFO mapred.JobClient: map 52% reduce 16% 12/05/16 14:52:27 INFO mapred.JobClient: map 52% reduce 17% 12/05/16 14:52:29 INFO mapred.JobClient: map 53% reduce 17% 12/05/16 14:52:32 INFO mapred.JobClient: map 54% reduce 17% 12/05/16 14:52:33 INFO mapred.JobClient: map 54% reduce 18% 12/05/16 14:52:35 INFO mapred.JobClient: map 55% reduce 18% 12/05/16 14:52:38 INFO mapred.JobClient: map 56% reduce 18% 12/05/16 14:52:40 INFO mapred.JobClient: map 57% reduce 18% 12/05/16 14:52:43 INFO mapred.JobClient: map 58% reduce 18% 12/05/16 14:52:45 INFO mapred.JobClient: map 59% reduce 18% 12/05/16 14:52:47 INFO mapred.JobClient: map 60% reduce 18% 12/05/16 14:52:48 INFO mapred.JobClient: map 61% reduce 19% 12/05/16 14:52:51 INFO mapred.JobClient: map 61% reduce 20% 12/05/16 14:52:53 INFO mapred.JobClient: map 62% reduce 20% 12/05/16 14:52:55 INFO mapred.JobClient: map 63% reduce 20% 12/05/16 14:52:58 INFO mapred.JobClient: map 64% reduce 20% 12/05/16 14:53:00 INFO mapred.JobClient: map 65% reduce 20% 12/05/16 14:53:03 INFO mapred.JobClient: map 65% reduce 21% 12/05/16 14:53:05 INFO mapred.JobClient: map 66% reduce 21% 12/05/16 14:53:08 INFO mapred.JobClient: map 67% reduce 21% 12/05/16 14:53:11 INFO mapred.JobClient: map 68% reduce 21% 12/05/16 14:53:12 INFO mapred.JobClient: map 68% reduce 22% 12/05/16 14:53:13 INFO mapred.JobClient: map 69% reduce 22% 12/05/16 14:53:18 INFO mapred.JobClient: map 70% reduce 23% 12/05/16 14:53:19 INFO mapred.JobClient: map 71% reduce 23% 12/05/16 14:53:21 INFO mapred.JobClient: map 72% reduce 23% 12/05/16 14:53:23 INFO mapred.JobClient: map 73% reduce 23% 12/05/16 14:53:27 INFO mapred.JobClient: map 75% reduce 23% 12/05/16 14:53:28 INFO mapred.JobClient: map 75% reduce 24% 12/05/16 14:53:30 INFO mapred.JobClient: map 76% reduce 24% 12/05/16 14:53:33 INFO mapred.JobClient: map 76% reduce 25% 12/05/16 14:53:35 INFO mapred.JobClient: map 77% reduce 25% 12/05/16 14:53:36 INFO mapred.JobClient: map 78% reduce 25% 12/05/16 14:53:38 INFO mapred.JobClient: map 79% reduce 25% 12/05/16 14:53:40 INFO mapred.JobClient: map 80% reduce 25% 12/05/16 14:53:41 INFO mapred.JobClient: map 81% reduce 25% 12/05/16 14:53:42 INFO mapred.JobClient: map 82% reduce 26% 12/05/16 14:53:47 INFO mapred.JobClient: map 83% reduce 26% 12/05/16 14:53:48 INFO mapred.JobClient: map 84% reduce 27% 12/05/16 14:53:50 INFO mapred.JobClient: map 85% reduce 27% 12/05/16 14:53:52 INFO mapred.JobClient: map 86% reduce 27% 12/05/16 14:53:56 INFO mapred.JobClient: map 87% reduce 27% 12/05/16 14:53:57 INFO mapred.JobClient: map 87% reduce 28% 12/05/16 14:54:00 INFO mapred.JobClient: map 88% reduce 28% 12/05/16 14:54:03 INFO mapred.JobClient: map 89% reduce 29% 12/05/16 14:54:07 INFO mapred.JobClient: map 90% reduce 29% 12/05/16 14:54:10 INFO mapred.JobClient: map 91% reduce 29% 12/05/16 14:54:12 INFO mapred.JobClient: map 91% reduce 30% 12/05/16 14:54:17 INFO mapred.JobClient: map 92% reduce 30% 12/05/16 14:54:19 INFO mapred.JobClient: map 93% reduce 30% 12/05/16 14:54:20 INFO mapred.JobClient: map 94% reduce 30% 12/05/16 14:54:24 INFO mapred.JobClient: map 95% reduce 30% 12/05/16 14:54:28 INFO mapred.JobClient: map 95% reduce 31% 12/05/16 14:54:30 INFO mapred.JobClient: map 96% reduce 31% 12/05/16 14:54:33 INFO mapred.JobClient: map 96% reduce 32% 12/05/16 14:54:35 INFO mapred.JobClient: map 97% reduce 32% 12/05/16 14:54:50 INFO mapred.JobClient: map 98% reduce 32% 12/05/16 14:55:00 INFO mapred.JobClient: map 99% reduce 32% 12/05/16 14:55:03 INFO mapred.JobClient: map 99% reduce 33% 12/05/16 14:55:55 INFO mapred.JobClient: map 100% reduce 33% 12/05/16 14:56:01 INFO mapred.JobClient: map 100% reduce 100% 12/05/16 14:56:02 INFO mapred.JobClient: Job complete: job_201205101503_0007 12/05/16 14:56:02 INFO mapred.JobClient: Counters: 27 12/05/16 14:56:02 INFO mapred.JobClient: Job Counters 12/05/16 14:56:02 INFO mapred.JobClient: Launched reduce tasks=1 12/05/16 14:56:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19618198 12/05/16 14:56:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/05/16 14:56:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/05/16 14:56:02 INFO mapred.JobClient: Rack-local map tasks=130 12/05/16 14:56:02 INFO mapred.JobClient: Launched map tasks=539 12/05/16 14:56:02 INFO mapred.JobClient: Data-local map tasks=409 12/05/16 14:56:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=361163 12/05/16 14:56:02 INFO mapred.JobClient: FileSystemCounters 12/05/16 14:56:02 INFO mapred.JobClient: FILE_BYTES_READ=35224 12/05/16 14:56:02 INFO mapred.JobClient: HDFS_BYTES_READ=169501 12/05/16 14:56:02 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33915785 12/05/16 14:56:02 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=4860 12/05/16 14:56:02 INFO mapred.JobClient: Map-Reduce Framework 12/05/16 14:56:02 INFO mapred.JobClient: Map input records=8757931 12/05/16 14:56:02 INFO mapred.JobClient: Reduce shuffle bytes=148813 12/05/16 14:56:02 INFO mapred.JobClient: Spilled Records=16522 12/05/16 14:56:02 INFO mapred.JobClient: Map output bytes=148765748 12/05/16 14:56:02 INFO mapred.JobClient: CPU time spent (ms)=8271190 12/05/16 14:56:02 INFO mapred.JobClient: Total committed heap usage (bytes)=103280607232 12/05/16 14:56:02 INFO mapred.JobClient: Combine input records=5812783 12/05/16 14:56:02 INFO mapred.JobClient: SPLIT_RAW_BYTES=169501 12/05/16 14:56:02 INFO mapred.JobClient: Reduce input records=8261 12/05/16 14:56:02 INFO mapred.JobClient: Reduce input groups=151 12/05/16 14:56:02 INFO mapred.JobClient: Combine output records=8261 12/05/16 14:56:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=139026960384 12/05/16 14:56:02 INFO mapred.JobClient: Reduce output records=151 12/05/16 14:56:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=354349391872 12/05/16 14:56:02 INFO mapred.JobClient: Map output records=5812783
Once the command has finished, you could look at the resulting mime distribution:
hadoop dfs -cat /mimetypes/part-r-00000
On the financial crisis crawl, this would produce something like this:
homer:trunk jsh2$ hadoop dfs -cat /mimetypes/part-r-00000 2012-05-16 15:14:35.659 java[46907:1903] Unable to load realm info from SCDynamicStore 12/05/16 15:14:35 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2011 2 ;?php bloginfo('html_type'); ?>; charset=utf-8 84 =iso-8859-1 1 application/atom+xml 143 application/java-vm 63 application/msword 530 application/octet-stream 21 application/ogg 23 application/pdf 21438 application/rdf+xml 1217 application/rss+xml 303938 application/rtf 8 application/vnd.ms-excel 14 application/vnd.ms-powerpoint 35 application/vnd.oasis.opendocument.text 1 application/vnd.openxmlformats-officedocument.presentationml.presentation 5 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 3 application/vnd.openxmlformats-officedocument.wordprocessingml.document3 application/vnd.rn-realmedia 7 application/x-123 105 application/x-bibtex-text-file 1 application/x-bzip 1 application/x-elc 468 application/x-font-ttf 1149 application/x-font-type1 11 application/x-gzip 937 application/x-rar-compressed 2 application/x-shockwave-flash 44097 application/x-tar 9 application/x-tika-msoffice 2 application/xhtml+xml 227881 application/xhtml+xml charset=shift_jis 1 application/xhtml+xml; charset='utf-8' 4 application/xhtml+xml; charset=iso-8859-1 517 application/xhtml+xml; charset=shift_jis 4601 application/xhtml+xml; charset=utf-8 1656 application/xhtml; charset=utf-8 3 application/xml 50504 application/xslt+xml 28 application/zip 88 audio/midi 4 audio/mpeg 7875 audio/ogg 38 audio/x-ms-wma 4 audio/x-wav 46 deutsche welle blogs; charset=utf-8 11 employment, litigation, discrimination, sexual harrasment, contracts, torts, unpaid wages, overtime federal, civil service, wrongful termination, civil rights, fair labor, standards severence 2 generation change; charset=utf-8 44 image/gif 537833 image/jpeg 1369888 image/png 435659 image/svg+xml 926 image/tiff 59 image/x-icon 7032 image/x-ms-bmp 1175 image/x-xcf 1 message/news 2 message/rfc822 2 noindex, nofollow 1 pieter wagemans,flower painting,flower paintings,oil painting,bloemen schilderij,flowers, 1 tect/html charset=utf-8 1 text-html; charset=iso-8859-1 1 text-html; charset=utf-8 16 text-html; charset=windows-1252 20 text.php; charset=utf-8 1 text/ht ml;charset=utf-8 2 text/html 341621 text/html charset=utf-8 96 text/html, charset=iso-8859-1 1 text/html, utf-8 1 text/html/ 5 text/html; <link rel= 2 text/html; >charset=iso-8859-1 6 text/html; _iso= 1 text/html; charset= 11 text/html; charset="' + this.input_enc + '" 2 text/html; charset="\"' + this.input_enc + '\"" 1 text/html; charset="\"+this.dataencoding+\"" 2 text/html; charset="\"\\\"+this.dataencoding+\\\"\"" 1 text/html; charset="\"iso 8859-1\"" 1 text/html; charset="\"utf-830 days to save 300 acres. chip in before it's too late\"" 1 text/html; charset="\[\(modx_charset\)\]" 3 text/html; charset="charset\=utf-8" 1 text/html; charset="iso 8859-1" 2 text/html; charset="iso-10646\/unicode" 1 text/html; charset="utf-830 days to save 300 acres. chip in before it's too late" 1 text/html; charset='+this.request_params.encoding+' 1 text/html; charset='windows-1252' 1 text/html; charset=8859-1 3 text/html; charset=__encoding__ 13 text/html; charset=ansi 2 text/html; charset=ansi_x3.110-1983 2 text/html; charset=big5 15 text/html; charset=cp1251 1 text/html; charset=euc 2 text/html; charset=euc-jp 19126 text/html; charset=euc-kr 7 text/html; charset=gb2312 207 text/html; charset=gb_2312-80 3 text/html; charset=gbk 48 text/html; charset=iso-2022-jp 3 text/html; charset=iso-8859 1 text/html; charset=iso-8859-1 156099 text/html; charset=iso-8859-13 7 text/html; charset=iso-8859-15 31484 text/html; charset=iso-8859-16 1 text/html; charset=iso-8859-2 38 text/html; charset=iso-8859-7 4384 text/html; charset=iso-8859-8 6 text/html; charset=iso-8859-9 33 text/html; charset=iso8859-1 189 text/html; charset=iso_8859-1 1 text/html; charset=koi8-r 1 text/html; charset=latin-1 1 text/html; charset=latin1 11 text/html; charset=latin9 1 text/html; charset=macintosh 3 text/html; charset=shift-jis 94 text/html; charset=shift_jis 4472 text/html; charset=unicode 1 text/html; charset=us-ascii 111 text/html; charset=utf-8 2033264 text/html; charset=utf8 38 text/html; charset=windows-1250 27 text/html; charset=windows-1251 212 text/html; charset=windows-1252 1538 text/html; charset=windows-1253 2 text/html; charset=windows-1254 10 text/html; charset=windows-1255 3868 text/html; charset=windows-1256 64 text/html; charset=windows-874 1 text/html; charset=x-mac-greek 1 text/html; charset={charset} 74 text/html; en= 1 text/html; iso-8859-1= 1878 text/html; utf-8= 120 text/htmlcharset=iso-8859-1 1 text/javascript; charset=utf-8 1 text/plain 174627 text/xhtml; charset=iso-8859-1 24 text/xhtml; charset=utf-8 16230 text/xml; charset=utf-8 2 text\/html; charset=utf-8 1 text\html; charset=iso-8859-1 1 utf-8 12 video/mpeg 2 video/quicktime 2055 video/x-flv 128 video/x-ms-asf 10 video/x-ms-wmv 201 video/x-msvideo 13
We could also run the previous module with a mimetype filter specified:
hadoop jar ArcomemOffline.jar SingleOfflineProcessRunner -c MimeTypeStatsProcess -o /mimetypes --filter mimetype -r "^video/.*$"
This results in a much shorter output file:
homer:trunk jsh2$ hadoop dfs -cat /mimetypes/part-r-00000 2012-05-16 15:26:43.096 java[47058:1903] Unable to load realm info from SCDynamicStore 12/05/16 15:26:43 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. video/mpeg 2 video/quicktime 2055 video/x-flv 128 video/x-ms-asf 10 video/x-ms-wmv 201 video/x-msvideo 13