suggested fix
The Tanimoto calcualtion for raw fingerprints was broken. This is an implementation of the continous tanimoto score used in DOI: 10.1021/ci800326z
suggested fix
A quick comment:
As to the change of algorithm, the original author should comment on that.
I think this formula is OK, but what about just going by the one described at http://blueobelisk.shapado.com/questions/similarity-calculations-between-circular-fingerprints? The reason I suggest an alternative is that it's not clear why the proposed formula multiplies the counts of features in the XY term. The answer on the BO xchange site seems more logical
I sort of see your point but when choosing between something which is published and something which is found on a webpage I must say I am leaning towards the published one...
The poster on Shapado mentions using Tanimoto in another answer on that page but it is unclear if it is the very same one or another, also a quick skim of that publication did not reveal any Tanimoto equation (maybe I didn't look closely enough?) Since I do not have enough points to comment on that shapado thing I can not ask the poster... :(
However, I have now collected three different ways of doing this and I am beginning to wonder if maybe CDK should implement all of them? I mean who are we to make the choice for the user? What's the general CDK philosophy in these cases?
I am beginning to wonder if maybe CDK should implement all of them?
Yes, please! The CDK philosophy is in fact to provide alternative algorithms, unless it makes absolutely no sense (e.g. when the algorithm is of zero interest, like having two 2D layout engines).
An example where it does make sense, is where alternative implementations have education value, e.g. with SSSR algorithms. The CDK has two now, and a third is being developed.
I am looking forward to your analysis of the three math equations / algorihms for calculating the Tanimoto distance! And to their matching CDK implementations too!
An interesting plot would be to compare the Tanimoto distances for one algorithm to another algorithm. Are the about the same? Does either algorithm consequently predict larger/smaller distances? Is any difference linear over the full range of [0,1]? If two algorithms indeed give different values (which I assume you mean; if they do not, the fastest is the only interesting one. that said, the same equation with a slow and a fast algorithm does have educational value too), then this is important to know, as it matters for selecting the more appropriate statistical method (as explained in my thesis).
OK, then this patch should go in
I had another look at the equation given on that shapado site. I don't wee how to apply that to count fingerprints. It looks like it is for bit fingerprints to me.
Jonathan, what should happen with this patch?
The attached patch is broken. The fix exists in other location. Closing this one.
Log in to post a comment.