Re: [Rdkit-discuss] Circular Morgan Fingerprints
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2013-08-06 03:03:23
|
Dear Isidro, On Mon, Aug 5, 2013 at 12:37 PM, Isidro Cortés <isi...@gm...>wrote: > Hi Greg, Hi All, > > Concerning the Morgan fingerprints in RDkit, I have several questions: > > - I am using > > fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,512,bitInfo=info) > > to caculate the fingerprints. I need them for machine learning. Therefore, > I would like to confirm what follows; each bit corresponds to a particular > chemical substructure, which can be mapped back to a sketch -plot- of the > substructure within the molecule. > If I calculate the fingerprint for a dataset, will each bit correspond to > the same chemical substructure for all the compounds? And, does each bit > correspond to a unique chemical substructure? I mean that there are not > clashes. > > In that case, which is the procedure to select which features will finally > appear in the fixed-length fingerprint? Is there a numerical or chemical > criterion? > The code that generates the Morgan fingerprints identifies the atom environments ("circular" substructures) using the algorithm described in this paper: Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010) http://dx.doi.org/10.1021/ci100050t Each environment is then hashed to give an unsigned 32bit integer. These hashes are used as the bit ids if you call AllChem.GetMorganFingerprint(). If you are using AllChem.GetMorganFingerprintAsBitVect(), the 32 bit unsigned bit id is divided by the bit vector size and the remainder is used as the new bit id (in pseudo code; newBitId = bitId%numBits). The original hashing process can certainly generate collisions (different substructures that map to the same bit), but I'm not aware of examples of it happening and I haven't actively gone looking for collisions. Hashing into the smaller space of a bit vector is much more likely to yield collisions. I have seen a couple specific examples of these at a fingerprint size of 1024 bits. 512 bits, as you are using above, is definitely going to have collisions. To answer your specific questions: 1) The same substructure will always set the same bit, regardless of which molecule it comes from. Which bit it sets depends on the size of the fingerprint. 2) Because of the hashing, it is possible that different substructures can set the same bit. The risk of this goes up as you hash into a smaller space. The RDKit implementation of the Morgan fingerprint is definitely well suited to machine learning; several examples have been posted here. If you are not happy with the hashing and want to have a pre-defined space of substructures to use for learning, the RDKit offers another possibility using the molecular fragmenter. There's documentation for this in the "Getting Started" guide: http://www.rdkit.org/docs/GettingStartedInPython.html#molecular-fragments I hope this helps, -greg |