Re: [Rdkit-discuss] topological fingerprints
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2010-12-28 05:24:14
|
Dear TJ, On Mon, Dec 27, 2010 at 11:41 PM, TJ O'Donnell <tj...@ac...> wrote: > I was surprised that, using topological fingerprints, the tanimoto > similarity between benzene and toluene is 0.32 > Examining the fp bits, I can see why. But I don't understand why so > many paths are repeated for toluene. > To my way of thinking, paths that trace the same types of atoms should > not be considered different, and therefore > set new bits. Am I missing something? Maybe one point: using the default arguments to RDKFingerprint, each path (really each subgraph since they can be branched) sets multiple bits. This is controlled by the nBitsPerHash argument to RDKFingerprint. Here's a demonstration: In [10]: bz = Chem.MolFromSmiles('c1ccccc1') In [11]: tl = Chem.MolFromSmiles('Cc1ccccc1') In [12]: fp1 = Chem.RDKFingerprint(bz,nBitsPerHash=1) In [13]: fp2 = Chem.RDKFingerprint(tl,nBitsPerHash=1) In [14]: fp1.GetNumOnBits() Out[14]: 6 In [15]: fp2.GetNumOnBits() Out[15]: 19 In [16]: iBits=fp1&fp2 In [17]: iBits.GetNumOnBits() Out[17]: 6 In [18]: fp12 = Chem.RDKFingerprint(bz,nBitsPerHash=2) In [19]: fp22 = Chem.RDKFingerprint(tl,nBitsPerHash=2) In [20]: fp12.GetNumOnBits() Out[20]: 12 In [21]: fp22.GetNumOnBits() Out[21]: 38 In [22]: iBits=fp12&fp22 In [23]: iBits.GetNumOnBits() Out[23]: 12 The default is to set 4 bits per subgraph: In [36]: fp14 = Chem.RDKFingerprint(bz) In [37]: fp14.GetNumOnBits() Out[37]: 24 In [38]: fp24 = Chem.RDKFingerprint(tl) In [39]: fp24.GetNumOnBits() Out[39]: 75 That last value is 75 instead of 76 because of a bit collision. In some other validation work I've done recently, it's become pretty clear that the default value for nBitsPerHash is too high: the bit densities for drug-like molecules get really high, which leads to a general increase in calculated similarities and too many molecules that have high calculated similarities but that don't look much alike (due to bit collisions). I've already changed the default value in the database cartridge to 2 bits per hash instead of 4 and am considering doing this from python as well, I'll cover that in a separate post (thanks for the reminder that I should bring it up). Best Regards, -greg |