MySQL extension functions / Feature Requests / #3 AGGLOM - Simple agglomerative clustering for MySQL?

#3 AGGLOM - Simple agglomerative clustering for MySQL?

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2005-03-24

Created: 2005-03-24

Creator: Dan Bolser

Private: No

The UDF would work on any NUMBER column, and return the
number of 'clusters' using agglomerative clustering
with a certain threshold as an input.

Agglomerative clustering merges any two numbers that
are within the 'threshold', and replaces those numbers
with the average of the two. The clustering proceedes
smallest 'gap' first, and stops when no two numbers are
within the threshold.

The result would be the number (or value) of the
remaining clusters.

For example

C1 C2
A 1
A 2
A 3
A 4
A 5
A 6
A 7
B 10
B 11
B 12
B 56
B 57
B 58
B 99
B 101

SELECT C1, AGGLOM(C2,1) AS C3;

C1 C3
A 4
B 6

SELECT C1, AGGLOM(C2,2) AS C3;

C1 C3
A 3
B 3

SELECT C1, AGGLOM(C2,3) AS C3;

C1 C3
A 2
B 3

SELECT C1, AGGLOM(C2,4) AS C3;

C1 C3
A 1
B 3

SELECT C1, AGGLOM(C2,50) AS C3;

C1 C3
A 1
B 1

Remember, merge numbers with the smallest difference
first, and replace each pair with the average of the
two. Recalculate the differences for the new number,
and repeat until no distance is smaller than the threshold.

This is a usefull 'hack' to see if a distribution is bi
or multi modal for example. It is very quick to
calculate using a hash table, and could be a great
function to add.

Discussion

Dan Bolser - 2005-03-24

Logged In: YES
user_id=810134

I think I meant a *heap* not a *hash*. Use the heap to store
the smallest difference between the two numbers, then take
that pair off the heap, cluster (if < thresh), stick the new
number in the heap and re-sort the heap.

I think that makes it O(n log n) in the worst case, and you
could optimize against the worst case if the threshold is
bigger than the average value (in which case I think their
is only ever one cluster (not sure).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AGGLOM - Simple agglomerative clustering for MySQL?

Group

Searches

Help

#3 AGGLOM - Simple agglomerative clustering for MySQL?

Discussion