Re: [Rdkit-discuss] How do rdFingerprintGenerator.GetMorganGenerator and AllChem.GetMorganFingerpr
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Lewis M. <lew...@gm...> - 2019-07-10 22:02:50
|
Thanks Greg! Would you mind giving a blurb or a link to a paper on how count simulation works? I looked through the GSOC pull request but unfortunately don't understand it. Agreed re. your comments. Usually 256bits is for playing around and then larger FPs are for 'production' runs. Although in my use case, for example logistic regression / naive bayes classifier on protein activity records in chembl, I really don't see a big difference despite collisions! That was prior to count simulation. cheers lewis On Wed, Jul 10, 2019 at 2:39 PM Greg Landrum <gre...@gm...> wrote: > Hi, > > By default the new fingerprint generators do "count simulation": adding > extra bits to a bit vector fingerprint in order to get bit-vector > similarities that are more similar to count-vector similarities. > You can turn this off by passing the useCountSimulation=False argument to > GetMorganGenerator(). > > Two comments about your sample code: > 1) 256 bits is really not very many for a Morgan fingerprint. Maybe you > were just using the small number for this question, but if you are really > using fingerprints that short you should be aware that you are going to > have a lot of collisions (blog post on this here: > http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html) > 2) In case you aren't aware of it: you can calculate similarities and do > fingerprint stats a lot more simply with builtin code like the > GetNumOnBits() method on bit vectors and the similarity calculation code > in rdkit.DataStructs. Take a look at DataStructs.DiceSimilarity() > > Hope this helps, > -greg > > > > On Wed, Jul 10, 2019 at 3:53 AM Lewis Martin <lew...@gm...> > wrote: > >> Hi all, >> Quick question on truncated fingerprints, any help is really appreciated. >> >> >> I think I've missed a trick on how the new fingerprint generator works. I >> thought the below should produce equivalent fingerprints but they are >> totally different. Has the implementation changed, or maybe I'm getting the >> kwargs incorrect? See below code or this link for a quick visual: >> https://github.com/ljmartin/snippets/blob/master/truncated_fingerprints.ipynb >> Thanks ! >> >> import rdkit >> from rdkit import Chem >> from rdkit.Chem import Draw, AllChem >> from rdkit.Chem import rdFingerprintGenerator >> from rdkit.Chem.Draw import IPythonConsole >> import numpy as np >> from scipy.spatial import distance >> >> mol = Chem.MolFromSmiles('CN1C(=O)CN=C(C2=C1C=CC(=C2)Cl)C3=CC=CC=C3') >> #diazepam >> >> gen_mo = rdFingerprintGenerator.GetMorganGenerator(fpSize=256, radius=2) >> a = gen_mo.GetFingerprint(mol) >> b = AllChem.GetMorganFingerprintAsBitVect(mol,2,256,useFeatures=False) >> a_f = [int(i) for i in a.ToBitString()] >> b_f = [int(i) for i in b.ToBitString()] >> print('NumBits a: %s, NumBits b: %s' % (np.sum(a_f), np.sum(b_f))) >> print('Dice Distance %s' % distance.dice(a_f,b_f)) >> >> >> NumBits a: 47, NumBits b: 38 >> Dice Distance 0.9058823529411765 >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > |