Re: [Rdkit-discuss] Complete Link clustering in RDKit
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Gonzalo C. <col...@gm...> - 2016-07-19 14:21:00
|
Thanks for this, Greg. In my experience Butina's (sphere exclusion) clustering produces more coherent clusters but adapts worse to small variations within a given chemotype. Complete linkage is more flexible and robust and the one I prefer, although it is slower. Btw, what's the difference of ML.Clustering with Chem.Fingerprints.ClusterMols? Thanks Gonzalo On Tue, Jul 19, 2016 at 1:44 PM, Greg Landrum <gre...@gm...> wrote: > Hi Gonzalo, > > > On Mon, Jul 18, 2016 at 9:54 AM, Gonzalo Colmenarejo < > col...@gm...> wrote: > >> >> I have succeeded in running a clustering of a set of molecules with the >> Complete Link Hierarchical clustering algorithm in RDKit. However, what I >> obtain is a clusters hierarchy object. I'd like to figure out now how to >> assign molecules to clusters for a particular similarity cutoff in the >> Complete Link algorithm (rather than provide the system with the number of >> clusters). Does anyone know how to do it? >> > > That's a good question, and one I had to think about for a bit in order to > come up with an answer. > > Here's a notebook showing how I solved the problem: > https://gist.github.com/greglandrum/6ff63e602b33d3c90d5b41325a4791ce > > The key is to know that the Cluster object's GetMetric() method returns > whatever the merge metric was for that particular cluster. For Complete > Linkage this corresponds to the largest distance (lowest similarity) > between points in the cluster. You can recurse through the cluster tree > using GetMetric() to pick out the sub-trees that are within your desired > cutoff value (this is the look()) function in my notebook. Recursing > through those trees to get the leaves (the get_leaves() function in my > notebook) allows you to get the indices of the molecules. > > This is likely to turn into an RDKit blog post (probably comparing the > sk-learn clustering with the RDKit clustering); it's an interesting little > problem and the solution could be pretty useful for comparing the output of > hierarchical methods with things like Butina clustering. > > Best, > -greg > > |