The UDF would work on any NUMBER column, and return the
number of 'clusters' using agglomerative clustering
with a certain threshold as an input.
Agglomerative clustering merges any two numbers that
are within the 'threshold', and replaces those numbers
with the average of the two. The clustering proceedes
smallest 'gap' first, and stops when no two numbers are
within the threshold.
The result would be the number (or value) of the
remaining clusters.
For example
C1 C2
A 1
A 2
A 3
A 4
A 5
A 6
A 7
B 10
B 11
B 12
B 56
B 57
B 58
B 99
B 101
SELECT C1, AGGLOM(C2,1) AS C3;
C1 C3
A 4
B 6
SELECT C1, AGGLOM(C2,2) AS C3;
C1 C3
A 3
B 3
SELECT C1, AGGLOM(C2,3) AS C3;
C1 C3
A 2
B 3
SELECT C1, AGGLOM(C2,4) AS C3;
C1 C3
A 1
B 3
SELECT C1, AGGLOM(C2,50) AS C3;
C1 C3
A 1
B 1
Remember, merge numbers with the smallest difference
first, and replace each pair with the average of the
two. Recalculate the differences for the new number,
and repeat until no distance is smaller than the threshold.
This is a usefull 'hack' to see if a distribution is bi
or multi modal for example. It is very quick to
calculate using a hash table, and could be a great
function to add.
Logged In: YES
user_id=810134
I think I meant a *heap* not a *hash*. Use the heap to store
the smallest difference between the two numbers, then take
that pair off the heap, cluster (if < thresh), stick the new
number in the heap and re-sort the heap.
I think that makes it O(n log n) in the worst case, and you
could optimize against the worst case if the threshold is
bigger than the average value (in which case I think their
is only ever one cluster (not sure).