In Text-Analysis is implemented Hierarchical cluster analysis (based on Fortran code contributed to STATLIB by F. Murtagh) and the following K-Means clustering algorithms:
Usually Hartigan and Wong provides the best results, but as all KMeans methods generally has a high variance since it is very dependent on the initialization. Neural gas overcomes this problem by using a ranked list of neighbouring data points instead using data points directly. It is more stable (at the cost of additional computational time).
The solutions obtained from the k-means are strongly dependent on the initialization of cluster centers. According [4] the best method it the one proposed by Kaufmann and Rousseeuw [5]. Besides Kaufmann and Rousseeuw also this method are implemented in the class eu.kostia.clustering.kmeans.InitialCentroids
:
All these methods are partially random and generate each time other results. Only Kaufmann and Rousseeuw is deterministic.
We want to group the dataset of food consumes in Europe (eu.kostia.statistics.dataset.ConsumesDataset
) into 3 clusters:
int k = 3;
Kmeans kmeans = new KmeansHW(new ConsumesDataset());
ClusterSpace cs = kmeans.getClusterSpace(k);
System.out.println(cs);
Where cs
is the set of clusters:
3 clusters; CH=18.779599998078446
Cluster 0: [IT, SP, GR] - Centroid: IT - WSS: 27733.06025600517
Cluster 1: [GE, FR, OL, AU, PO, BE, GB] - Centroid: GE - WSS: 40211.030911364855
Cluster 2: [NO, FI, SV, IS, DA, IR] - Centroid: NO - WSS: 62521.04457966973
CH is the The Calinski & Harabsz index [8]. The best number of cluster (k) is the one that maximizes the CH index. If there is a cluster with just one element, the index is -1. A good Cluster Space has clusters with small WITHIN-cluster sum-of-squares and high BETWEEN-cluster sum-of-squares.
WSS represents the WITHIN-cluster sum-of-squares, a measure of the compactness of the cluster (the lower, the better). The first cluster in the list is always the most centrally (for example IT in the first cluster) and the remaining are sorted by distance with the centroid. By default the euclidean distance is computed on the data points to cluster, but also other metrics could be passed to the constructor:
It often a good idea to standardise the data before clustering.
[1] Mac Queen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
[2] Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.
[3] Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
[4] Pena, Lozano, Larranaga, An empirical comparison of four initialization methods for the k-means algorithm, <http://www.sc.ehu.es/ccwbayes/publications/postscript/kmrev2.ps[[BR>]] [5] Kaufmann and Rousseeuw in "Finding Groups in Data. Introduction to Cluster Analysis", Wiley-Interscience; 9th edition (March 1990).
[6] Moth’d Belal. Al-Daoud, "A New Algorithm for Cluster Initialization", World Academy of Science, Engineering and Technology 4 2005, <http://www.waset.org/journals/waset/v4/v4-20.pdf[[BR>]] [7] Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027--1035.
[8] Calinski, R.B. and Harabasz, J., 1974, "A dendrite method for cluster analysis". Communications in Statistics, 3(1): 1-27.