Fixed bug caused by numpy arrays and KMeansClustering.
The KCuster constructor now accepts an optional function to test for item equality. If the clusters contain numpy arrays, you can pass "numpy.array_equals".
The project is now available in github here:
The news-feed and project page will remain on sourceforge for now.
Finally migrated to SVN (https://sourceforge.net/svn/?group_id=170665).
I did not bother about migrating the history as well. Not really necessary for such a small project. I will leave CVS access enabled for a while but will take it down eventually. So if you pull from CVS, make sure that you switch to the SVN repo as soon as possible.
A helpful hand discovered a broken link in the source.
Alternative link: http://mail.python.org/pipermail/python-list/2004-December/294990.html
It's not the same link, but contains similar information. I hope this link will stay.
- Applied patch  (thanks ajaksu)
--> Topology output supported
--> data and raw_data are now properties.
K-Means clustering is now implemented
This is a bug-fix release!
It fixes bug #1516204 that caused the clustering algorithm to raise an exception if an empty list or a list of only one item was supplied as argument.
In addition I added some unit-tests. This keeps development and bug-tracking a lot easier.
Eeeks. The last header in the news read "new release for python-ngram". Although that is somewhat true, it really should have read "python-cluster".
Finally I got around to build the dist-files and upload them. Enjoy.
You might notice that there are only the source distribution and the windows binary distribution. I decided to drop the rest because they are in my opinion unneeded. They all behave exactly the same as the source distribution anyway. Yes, I could have created the other files as well for convenience, but how hard is it really to type "python setup.py install"? ;)... read more
I finished a new version today. Now the hierarchical clustering works twice(!) as fast. This is achieved because the distance-matrix which is generated internally is symmetric. So I only need to calculate one half of the possible combinations.
More improvement is possible. But I gave up on that today. Too complex.
I also started to work out the details for a K-Means algorithm on the Airport this weekend. On paper it looks sensible and I beleive it should work as I wrote it down. Now I only need to beam my scribblings onto the harddisk. And hope it all works ;)
Late last night I started having a look into the K-Means algorithm. Seems easy as such. But.... as I want to keep a general approach so one could cluster any object it becomes a bit less a trivial task.
The problem/assumption of the K-Means algorithm is that the data-elements need to be representable in vector-space. This is something I cannot get around.
It does not look too complicated. I just have to find a general approach how to calculate the centroid of a set of objects. Maybe this requires that (similar to the distance function with the hierarchical cluster) the user needs to supply a utility function. But I'll try to avoid that.... read more
1.0.1b1 is released.
This now supports different linkage algorithms.
It is becoming obvious though that some rethinking is needed soon to implement other clustering algorithms. We will see.
Iam hoping that this won't be the case, as I intend to enable the different algorithms first and worry about optimizing later.
I gave up for on SVN. Somehow sourceforge does not like me. As this module has a pretty simple file-structure anyway, I can live with that.
For now it's available in CVS. Information on accessing CVS can be found here: http://sourceforge.net/cvs/?group_id=170665
Alright. The project home-page is up. There's not much on it yet, but it's there so people don't get presented this empty dir-listing anymore.
I tried to get the project into SVN, but I get some errors. I am still investigating that.
The first algorithm is finished. Next I will implement the different methods of calculating the distance between one cluster and another. Once that is done I will implement the other clustering algorithms. This second part could take a while as I don't need it. I would only do it to make this package more complete.
I identified two sources for optimisation.
- Every iteration during clustering the matrix is completely re-generated. Instead when clustreing a pair of items, it should only remove those two elements from the list and append the new cluster. This would save an awful lot of operations.
- The distance from A to B is the same as from B to A. That means that the matrix is symmetric. Therefore, we only need to generate and examine half of the matrix. Again, that would be a massive speedup.
Good good. It's working.
One last test with a larger data-set is currently running. Once that's done and shows proper results, I'll submit new files. Due to the horrid complexity of most clustering algorithms, and because I did not yet worry about optimizing this stuff, it runns terribly slow on large data sets.
The crash bug is now solved.
But now something else crept up. Somehow the data returned is not quite correct. Fiddling around with that method again makes me shiver. Tried many times, but somehow I always screwed up with that one. Although it should be really easy.
Great. Now I hammered together a quick webpage for this thing, and have to realise, that the only way to upload it is via SSH. That means, no webpage for this project before I'm getting home. I'm disappointed :|
Oh well. All you need is in the python-docs anyway. If you need to know how to work things, run a python shell, and do this:
from cluster import *