Menu

Clustering_with_Hadoop

Neeti Pokhriyal Dasha
There is a newer version of this page. You can find it here.

Introduction

The MapReduce paradigm was explored for knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. Hadoop, an open-source implementation of MapReduce, was used for this study.

For preliminary investigation, k-Means clustering, available from Mahout was employed. The results were similar to the ones we published in the paper, "Knowledge Discovery from Nuclear Reactor Simulation Data".

Process

Help for the various command line options for kmeans can be found at: https://cwiki.apache.org/MAHOUT/k-means-commandline.html....

    • Get the clusteredPoints directory out of HDFS to local directory.
    • Run mahout seqdumper to output them. An example is: mahout seqdumper -i /data/diff_unc_trans_kmeans_op/clusteredPoints/part-m-00000 -o ~/Data_orig/diff_unc_trans_hadoop_op/.
    • This can be further processed for analysis.

Resources

For more information on Hadoop, see their website, which contains many useful resources, including a tutorial on how to use Hadoop.

More details on Mahout and its uses can be found at the Apache Mahout website. The developers keep the website very well maintained, and you can find a great introduction to k-Means clustering, in addition to information on many other methods.

Some helpful materials on MapReduce include:


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.