From: Jonas M. <jo...@ma...> - 2006-03-07 01:17:10
|
I merged a couple of patches that had been applied to the RC branch back to= =20 trunk. This required subsequently to make sure that the data center=20 calculation happens at a later point in time, more specifically after the=20 data transformation is done. Apart from that I finally setteled on a way to handle the case of 0=20 responsibilities of individual data samples in soft k-means and mixture=20 models. This is a problem that is more important in higher dimensional data= =20 as there our usual euclidean distance amounts to higher relative difference= s=20 between data samples and cluster centers which leads to a more uneven=20 distribution of responsibilities among the data samples. When the=20 responsibilitiy for the total of clusters of an individual sample gets=20 rounded down to 0 we are in numerical problems. As this is a non-trivial issue with these algorithms I decided that we shou= ld=20 at least prompt the user and point him to the issue. Possibilities in this case include: restarting the algorithm and hoping for= =20 the best, or using hard k-means which seemingly does not suffer from this=20 problem. We might consider implementing algorithmic remedies to deal with this probl= em=20 and / or the problem of tiny cluster center stdandard deviations. For the=20 latter we could maybe enforce minimal values for the standard deviations.=20 This might even help in certain of the prior cases. For the prior cases we= =20 could possibly also resort to a technique along the notion of pseudo counts= ,=20 but I did not yet think this through I must say. =46eedback highly appreciated - as always! With this issue settled for the moment I would consider the moment come to= =20 release the version 0.2 of Clusterviz - at this time still without the new= =20 additions of Karsten's in trunk. What are your thoughts? Give this RC a tes= t=20 and tell me if there are any issues left in your opinion. Kind regards, Jonas. |