Re: [Rdkit-discuss] Butina clustering with additional output
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Andrew D. <da...@da...> - 2018-09-25 12:09:21
|
On Sep 21, 2018, at 14:53, Philipp Thiel <th...@in...> wrote: > you probably read about the Tanimoto being a proper metric in case of having binary data > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the revised edition. What we call Tanimoto is more broadly known as the Jaccard. Various sites demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, such as https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 . Going back to James T. Metz's original question, one alternative might be to use chemfp and the Taylor-Butina clustering implementation available at: http://dalkescientific.com/writings/taylor_butina.py Following Dave Cosgrove's advice: > I expect James means what we used to call the cluster seed, i.e. the molecule the cluster was based on, rather than the mathematical centroid. Calculating distances from each cluster member to that would be quite straightforward as a post-processing step although that would roughly double the time taken. it's possible to change the reporting code from: for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile) so it does the post-processing: print(len(clusters), "clusters", file=outfile) for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) subarena = arena.copy(indices=members) centroid_fp = arena.get_fingerprint(centroid_idx) result = subarena.threshold_tanimoto_search_fp(centroid_fp, threshold=0.0) result.reorder() # sort so the highest scores come first for id, score in result.get_ids_and_scores(): print("=>", id, "score:", score) Cheers, Andrew da...@da... |