Re: [Rdkit-discuss] Butina clustering with additional output
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Andrew D. <da...@da...> - 2018-09-26 19:08:07
|
On Sep 26, 2018, at 20:26, Peter S. Shenkin <sh...@gm...> wrote: > Ah, David, but how do you define a "real" singleton? There can be many different definitions of what a '"real" singleton' might be, but we are specifically talking about Butina clustering. The Butina paper defines the term "false singleton", which Dave quoted. The relevant text from DOI: 10.1021/ci9803381 is: """The molecules that have not been flagged by the end of the clustering process, either as a cluster centroid or as a cluster member, become singletons. It is important to emphasize at this stage that one of the consequences of this approach is that some molecules defined as singletons may have neighbors at the given Tanimoto similarity index, but those neighbors have been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors in its list. .... the problem with the creation of a number of false singletons that do in fact have similar compounds within the set is easily offset by the final quality of the clusters that this approach generates.""" As you can see, there are two types of singletons, and one is called "false singleton". No specific name is used for the other type of singleton, but it's easy to how they can be called "real" singletons, without confusion or misunderstanding. (FWIW, my implementation, mentioned in an earlier email, uses the term "true singleton" as the singleton which is not a "false singleton", but the difference is only in the label.) To confirm that this is what Dave means, I'll quote from his paper Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of compound libraries for fragment screening. Journal of Computer-Aided Molecular Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5 """The clustering program flush_clus is an implementation of the sphere-exclusion algorithm of Taylor [41], which has also been reported independently by Butina ... One consequence of the algorithm is the production of ‘false singleton clusters.’ The final clusters in the output are invariably singleton clusters, where the only member is the seed. Some of these will be true singletons, i.e. molecules lacking neighbors within the clustering threshold, but others (the false singletons) will be singletons by virtue of the fact that their neighbors were placed in other larger clusters in a previous iteration of the algorithm. The flush_clus program offers the opportunity of performing a final sweep through the clusters using a larger similarity threshold and placing the singleton molecules within the cluster for which it has the greatest similarity with the seed, so long as this is within the threshold.""" Cheers, Andrew da...@da... |