text-analysis Wiki

Status: Alpha

Brought to you by: kostia76

Clustering

Labels:

Clustering

In Text-Analysis is implemented Hierarchical cluster analysis (based on Fortran code contributed to STATLIB by F. Murtagh) and the following K-Means clustering algorithms:

Mac Queen, J. (1967) [1].
Hartigan, J. A. and Wong, M. A. (1979) [2].
Neural-Gas [3]

Usually Hartigan and Wong provides the best results, but as all KMeans methods generally has a high variance since it is very dependent on the initialization. Neural gas overcomes this problem by using a ranked list of neighbouring data points instead using data points directly. It is more stable (at the cost of additional computational time).

The solutions obtained from the k-means are strongly dependent on the initialization of cluster centers. According [4] the best method it the one proposed by Kaufmann and Rousseeuw [5]. Besides Kaufmann and Rousseeuw also this method are implemented in the class eu.kostia.clustering.kmeans.InitialCentroids:

Random
Moth’d Belal. Al-Daoud [6]
k-means++ [7]
Uniform (data point m/k, 2*m/k, ... 1)

All these methods are partially random and generate each time other results. Only Kaufmann and Rousseeuw is deterministic.

Example

We want to group the dataset of food consumes in Europe (eu.kostia.statistics.dataset.ConsumesDataset) into 3 clusters:

int k = 3;
Kmeans kmeans = new KmeansHW(new ConsumesDataset());
ClusterSpace cs = kmeans.getClusterSpace(k);

System.out.println(cs);

Where cs is the set of clusters:

3 clusters; CH=18.779599998078446
  Cluster 0: [IT, SP, GR] - Centroid: IT - WSS: 27733.06025600517
  Cluster 1: [GE, FR, OL, AU, PO, BE, GB] - Centroid: GE - WSS: 40211.030911364855
  Cluster 2: [NO, FI, SV, IS, DA, IR] - Centroid: NO - WSS: 62521.04457966973

CH is the The Calinski & Harabsz index [8]. The best number of cluster (k) is the one that maximizes the CH index. If there is a cluster with just one element, the index is -1. A good Cluster Space has clusters with small WITHIN-cluster sum-of-squares and high BETWEEN-cluster sum-of-squares.

WSS represents the WITHIN-cluster sum-of-squares, a measure of the compactness of the cluster (the lower, the better). The first cluster in the list is always the most centrally (for example IT in the first cluster) and the remaining are sorted by distance with the centroid. By default the euclidean distance is computed on the data points to cluster, but also other metrics could be passed to the constructor:

Manhattan: The distance between two points measured as the sum of the (absolute) differences of their coordinates, as the path of a taxi in the streets of Manhattan
Maximum: The distance between two points measured as the maximum interval between them.
Minkowski: The Minkowski distance of order p between two points is defined as: (sumi=1...nˆ(1/p). When p = 2, it like the euclidean distance, when p = 1 like the Manhattan.

It often a good idea to standardise the data before clustering.

Reference

[1] Mac Queen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
[2] Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.
[3] Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
[4] Pena, Lozano, Larranaga, An empirical comparison of four initialization methods for the k-means algorithm, <http://www.sc.ehu.es/ccwbayes/publications/postscript/kmrev2.ps[[BR>]] [5] Kaufmann and Rousseeuw in "Finding Groups in Data. Introduction to Cluster Analysis", Wiley-Interscience; 9th edition (March 1990).
[6] Moth’d Belal. Al-Daoud, "A New Algorithm for Cluster Initialization", World Academy of Science, Engineering and Technology 4 2005, <http://www.waset.org/journals/waset/v4/v4-20.pdf[[BR>]] [7] Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027--1035.
[8] Calinski, R.B. and Harabasz, J., 1974, "A dendrite method for cluster analysis". Communications in Statistics, 3(1): 1-27.

Wiki: Home

text-analysis Wiki

Clustering

Clustering

Example

Reference

Related