Re: [Rdkit-discuss] Butina clustering with additional output
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Philipp T. <th...@in...> - 2018-09-21 12:53:21
|
Hi, you probably read about the Tanimoto being a proper metric in case of having binary data in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the revised edition. Best, Philipp Thiel > From: "David Cosgrove" <dav...@gm...> > To: "Chris Earnshaw" <cge...@gm...> > Cc: "Rdk...@li..." <rdk...@li...>, > "James T. Metz" <jam...@ao...> > Sent: Friday, 21 September, 2018 13:45:18 > Subject: Re: [Rdkit-discuss] Butina clustering with additional output > I used to have a paper that demonstrated that the tanimoto coefficient does, in > fact, obey the triangle inequality. I fear I lost access to it when I retired > but maybe a determined google expert could rediscover it. > I expect James means what we used to call the cluster seed, i.e. the molecule > the cluster was based on, rather than the mathematical centroid. Calculating > distances from each cluster member to that would be quite straightforward as a > post-processing step although that would roughly double the time taken. > Regards , > Dave > On Fri, 21 Sep 2018 at 09:55, Chris Earnshaw < [ mailto:cge...@gm... | > cge...@gm... ] > wrote: >> Hi >> I'm afraid I can't help with an RDkit solution to your question, but there are a >> couple of issues which should be born in mind: >> 1) The centroid of a cluster is a vector mean of the fingerprints of all the >> members of the cluster and probably will not be represented exactly by any >> member of the cluster; in this case no structures will have a distance of 0.0 >> from the centroid. Do you want to calculate the distances from the true >> centroid or from the structure(s) closest to the centroid? >> 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore >> sub-optimal for this kind of analysis. It's better to use an alternative which >> does obey the triangle inequality - e.g. the Cosine metric. >> Regards, >> Chris Earnshaw >> On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss < [ >> mailto:rdk...@li... | >> rdk...@li... ] > wrote: >>> RDkit Discussion Group, >>> I note that RDkit can perform Butina clustering. Given an SDF of >>> small molecules I would like to cluster the ligands, but obtain additional >>> information from the clustering algorithm. In particular, I would like to obtain >>> the cluster number and Tanimoto distance from the centroid for every ligand >>> in the SDF. The centroid would obviously have a distance of 0.00. >>> Has anyone written additional RDkit code to extract this additional information? >>> Thank you. >>> Regards, >>> Jim Metz >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> [ mailto:Rdk...@li... | >>> Rdk...@li... ] >>> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] >> _______________________________________________ >> Rdkit-discuss mailing list >> [ mailto:Rdk...@li... | >> Rdk...@li... ] >> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > [ http://cozchemix.co.uk/ | http://cozchemix.co.uk ] > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |