[Rdkit-discuss] history of RDKit's count Tanimoto
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Andrew D. <da...@da...> - 2025-09-04 16:40:13
|
Hi all, RDKit implements Tanimoto similarity for count fingerprints. I only last week realized there's been a change in what "Tanimoto similarity" means for count fingerprints, and RDKit seems to be the reason for the shift. I'm curious to know the history. * Tanimoto #1 is Σaᵢbᵢ/(Σaᵢ²+Σbᵢ²-Σaᵢbᵢ), that is, it interprets count fingerprints as a vector The oldest citation I have is Bawden, "Browsing and Clustering of Chemical Structures" on p147 of "Chemical structures" (1988) from the first ICCS. A more accessible citation is Willett, "Chemical Similarity Searching" JCICS (1998) 38, 983-996 available at https://web.archive.org/web/20040218213916/http://www-personal.engin.umich.edu:80/~wildd/che697/willett98.pdf . See page 987, the "formula for continuous values" under "Tanimoto Coefficient". My literature search shows it was the main definition for almost 30 years. * Tanimoto #2 is Σmin(aᵢ,bᵢ)/Σmax(aᵢ,bᵢ), that is, what Wikipedia calls the "weighted Jaccard similarity." This is what RDKit uses. It was committed to Code/DataStructs/SparseIntVect.h on 2009-Jun-18, as part of adding Tversky similarity, and a couple of years after adding Dice similarity. I believe that as a result of RDKit's popularity, recent papers have taking to describing this as, for example, "the counted Tanimoto similarity" in like https://jcheminf.biomedcentral.com/articles/10.1186/s13321-025-01081-6 ("also known as the multiset coefficient calculation"). Does anyone here know how RDKit came to be the way it is? In my literature search, I believe the similarity function for Tanimoto #2 was first proposed by Henry Allan Gleason, "Some Applications of the Quadrat Method", Bulletin of the Torrey Botanical Club, Vol. 47, No. 1 (Jan., 1920), pp. 21-33, starting on page 31 where he proposes adding species abundance to Jaccard's similarity. See https://archive.org/details/jstor-2480223/page/n11/mode/2up Some people (and https://en.wikipedia.org/wiki/Jaccard_index) refer to this as Ruzicka similarity, from Ruzicka (1958), but on the Mastodon discussion at https://mstdn.science/@molecule/115142680945701031 you'll wim (@mol...@ms...) got a copy of the relevant part of Ruzicka's paper, and it appears to be identical to Gleason's extension to Jaccard similarity -- not even in the cool looking min/max formulation as attributed in, eg, https://archive.org/details/dictionaryofdist0000deza/mode/2up?q=Ruzicka . The first paper which applied Tanimoto #2 to fingerprints appears to be introduced by Swamidass et al., "Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity", Bioinformatics, Volume 21, Issue suppl_1, June 2005, Pages i359–i368, https://doi.org/10.1093/bioinformatics/bti1055 where they call it the "MinMax" kernel and explicitly compare it to Tanimoto #1. Some papers since then refer to Tanimoto #2 as MinMax. Now, I was able to find a use of (1-Tanimoto #2) as a similarity measure ("measure" used in its mathematical meaning) in Thomas Ott, Albert Kern, Ausgar Schuffenhauer, Maxim Popov, Pierre Acklin, Edgar Jacoby, and Ruedi Stoop, "Sequential Superparamagnetic Clustering for Unbiased Classification of High-Dimensional Chemical Data", J. Chem. Inf. Comput. Sci. 2004, 44, 1358-1364 available from https://tilde.ini.uzh.ch/users/tott/public_html/jcheminf.pdf but it is unnamed -- and a measure, not a similarity. That makes me quite curious on how RDKit ended up the way it does. To be clear, I prefer the similarity function given in #2 over that of #1, though I think having two "Tanimoto" definitions is going to be confusing. If only the Sheffield folks back in the 1980s had known. But hey, that's how we ended up with "Tanimoto" instead of "Jaccard". :) Best regards, Andrew da...@da... P.S. If anyone knows of older citation, please let me know. There aren't good search tools for finding this formula, so it's a lot of tedious manual work. |