MapReduce is a programming model that assists in dealing with large datasets. Useful across many domains, one specific application for which the MapReduce paradigm has been explored is knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. Hadoop, an open-source implementation of MapReduce, was used for this study.
For preliminary investigation, k-Means clustering, available from Mahout was employed. The results were similar to the ones we published in the paper, "Knowledge Discovery from Nuclear Reactor Simulation Data".
Here are the steps we used for working with Hadoop:
My data files are in the .csv format. Mahout's k-Means clustering expects the data to be formatted into a Mahout-specific SequenceFile format. A Java utility was written to convert our data into a dense sequence file format, as needed by the k-Means clustering algorithm. Please note that the examples for k-Means in Mahout mostly work with converting text data into sparse vector format. However, this was not the case with our data.
Run k-Means with the -cl option. For example:
mahout kmeans -i /data/diff_unc_trans_seq/part-m-00000 -o /data/diff_unc_trans_kmeans_op -x 20 -k 2 -c /data/analysis_clusters -cl
Information on the various command line options for k-Means can be found at http://mahout.apache.org/users/cluste.../k-means-commandline.html.
Run mahout seqdumper to output them. For example:
mahout seqdumper -i /data/diff_unc_trans_kmeans_op/clusteredPoints/part-m-00000 -o ~/Data_orig/diff_unc_trans_hadoop_op/.
This can be further processed for analysis.
For more information on Hadoop, see their website, which contains many useful resources, including a tutorial on how to use Hadoop.
More details on Mahout and its uses can be found at the Apache Mahout website. The developers keep the website very well maintained, and you can find a great introduction to k-Means clustering, in addition to information on many other methods.
Some further helpful materials on MapReduce include: