NiCE Wiki

Modeling and Simulation made NiCE!

Brought to you by: amccaskey, jayjaybillings

Clustering_with_Hadoop

Authors:

There is a newer version of this page. You can find it here.

The Map-Reduce paradigm http://cacm.acm.org/magazines/2010/1/.../fulltext was explored for knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. Hadoop[<http://hadoop.apache.org/>], which is an open-source implementation of Map-Reduce, was used for this study.
For preliminary investigation, kmeans clustering http://nlp.stanford.edu/IR-book/html/.../k-means-1.html available from mahout[<http://mahout.apache.org/>] was employed. The results were similar to the ones we published in the paper - "Knowledge Discovery from Nuclear Reactor Simulation Data".
A few important resources to learn Map-Reduce are:
- Book Chapter 2 in Mining of Massive Datasets by Anand Rajaraman and Jeffrey David Ullman http://infolab.stanford.edu/~ullman/m.../ch2.pdf.
- The seminal paper http://cacm.acm.org/magazines/2010/1/.../fulltext] and the wiki article [<http://en.wikipedia.org/wiki/MapReduce>].
A few important resources for Hadoop are:
- The tutorial at http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html....
- The hadoop website at [<http://hadoop.apache.org/>].
Some resources for learning mahout are:
- The mahout website [<http://mahout.apache.org/>].
- As we used the kmeans clustering in mahout, https://cwiki.apache.org/confluence/d.../K-Means+Clustering gives a good introduction.
Here are the steps, we used for working with Hadoop:
- Successfully install Hadoop from [<http://hadoop.apache.org/>]. I used the instructions from http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-singl... to install Hadoop on my laptop as a single node cluster. My laptop was running on Fedora 17.
- The database used was HDFS - Hadoop Distributed File System http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html.... A helpful tutorial on HDFS would be http://developer.yahoo.com/hadoop/tut.../module2.html.
- Successfully install Mahout from [<http://mahout.apache.org/>].
- For examples on how to run various machine learning algorithms on mahout, are found at /mahout/examples/bin.
- Put the requisite data for analysis on HDFS.
- My data files are in the .csv format. Mahout's kmeans clustering expects the data to be formatted into a mahout specific SequenceFile format. A Java utility was written to convert our data into dense sequence file format, as needed by the kmeans clustering algo. Please note that mostly the examples for kmeans in mahout work with text data, and work on converting it into sparse vector format https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-th.... However, this was not the case with our data.
- Run kmeans with -cl option. An example is: mahout kmeans -i /data/diff_unc_trans_seq/part-m-00000 -o /data/diff_unc_trans_kmeans_op -x 20 -k 2 -c /data/analysis_clusters -cl

Help for the various command line options for kmeans can be found at: https://cwiki.apache.org/MAHOUT/k-means-commandline.html....

- Get the clusteredPoints directory out of HDFS to local directory.
- Run mahout seqdumper to output them. An example is: mahout seqdumper -i /data/diff_unc_trans_kmeans_op/clusteredPoints/part-m-00000 -o ~/Data_from_Andrew/diff_unc_trans_hadoop_op/.
- This can be further processed for analysis.