Re: [Rdkit-discuss] Butina clustering with additional output

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sep 21, 2018, at 14:53, Philipp Thiel <th...@in...> wrote:
> you probably read about the Tanimoto being a proper metric in case of having binary data
> in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the revised edition.

What we call Tanimoto is more broadly known as the Jaccard. Various sites demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, such as https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 .

Going back to James T. Metz's original question, one alternative might be to use chemfp and the Taylor-Butina clustering implementation available at: 

  http://dalkescientific.com/writings/taylor_butina.py

Following Dave Cosgrove's advice: 

> I expect James means what we used to call the cluster seed, i.e. the molecule the cluster was based on, rather than the mathematical centroid. Calculating distances from each cluster member to that would be quite straightforward as a post-processing step although that would roughly double the time taken. 

it's possible to change the reporting code from:

    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile)
        print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)

so it does the post-processing:

    print(len(clusters), "clusters", file=outfile)
    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile)
        subarena = arena.copy(indices=members)
        centroid_fp = arena.get_fingerprint(centroid_idx)
        result = subarena.threshold_tanimoto_search_fp(centroid_fp, threshold=0.0)
        result.reorder()  # sort so the highest scores come first
        for id, score in result.get_ids_and_scores():
            print("=>", id, "score:", score)

Cheers,

				Andrew
				da...@da...

Re: [Rdkit-discuss] Butina clustering with additional output

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Butina clustering with additional output