From: Chris Thorp <thorp@sp...> - 2004-09-13 22:36:07
There is an issue with the k-means initialization. I'm looking for
suggestions regarding the k-means initial cluster assignments. The
problem is as follows: the current initialization of k-means, which is
the "textbook"initialization, is to choose random starting locations for
the clusters somewhere within the range of the dataset. Unfortunately,
the range is made artificially large by a few outliers, IE most points
have a dimensional range of -10 to 10, but there are a few artifacts
that range from -1000 to 1000. This places almost every useful point
within one cluster -- probably not a very useful way to start doing CEM.
A couple alternatives to truly random starting location selection
(though others are more than welcome):
1. Choose a random point from the dataset as the center of each k-means
2. Strip the outliers, at some sigma, before choosing the initial
k-means cluster centers.