Re: [Rdkit-discuss] 64 bit Morgan Fingerpronts
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Gareth J. <jav...@gm...> - 2021-04-22 19:56:53
|
Hi Wojtek, Our findings are the same. There is a Morgan fingerprint generator for 64 bits, which Python uses by default. When you call it the functions that actually set the bits in the 64 bit fingerprint (MorganFingerprints::getConnectivityInvariants and MorganFingerprints::getFeatureInvariants) will only ever set the first 32 bits. So you have a 64 bit fingerprint, but only the first 32 bits are ever set. On 4/22/2021 12:20 PM, Wojtek Plonka wrote: > Hi Gareth, > > Your findings are a bit contrary to mine, so the truth must be > somewhere in between :) > I downloaded the RDKit sources and some support for 64 bit Morgan > Fingerprints seems to be there: > > Search "getMorganGenerator<std::uint64_t>" (7 hits in 4 files of 661 > searched) > C:\RDKit\rdkit\Code\GraphMol\Fingerprints\catch_tests.cpp (1 hit) > Line 152: MorganFingerprint::getMorganGenerator<std::uint64_t>(radius)); > C:\RDKit\rdkit\Code\GraphMol\Fingerprints\FingerprintGenerator.cpp (4 > hits) > Line 461: generator = > MorganFingerprint::getMorganGenerator<std::uint64_t>(2); > Line 497: generator = > MorganFingerprint::getMorganGenerator<std::uint64_t>(2); > Line 533: generator = > MorganFingerprint::getMorganGenerator<std::uint64_t>(2); > Line 569: generator = > MorganFingerprint::getMorganGenerator<std::uint64_t>(2); > C:\RDKit\rdkit\Code\GraphMol\Fingerprints\testFingerprintGenerators.cpp > (1 hit) > Line 2387: MorganFingerprint::getMorganGenerator<std::uint64_t>(2), > C:\RDKit\rdkit\Code\GraphMol\Fingerprints\Wrap\MorganWrapper.cpp (1 hit) > Line 78: "GetMorganGenerator", getMorganGenerator<std::uint64_t>, > > I will have a closer look at that. > I don't need to write my code in Python, C++ (with Google's help) is > fine, too, as long as I can compile it with Linux tools of MSVC > Community Edition. > Maybe simply 64 bit stuff is not complete or not interfaced to Python yet? > Thanks! > > Wojtek Plonka > +48885756652 > wojtekplonka.com <http://www.wojtekplonka.com> > fb.com/wojtek.plonka <https://fb.com/wojtek.plonka> > > > > On Thu, Apr 22, 2021 at 7:17 PM Gareth Jones <jav...@gm... > <mailto:jav...@gm...>> wrote: > > > Hi Wojtek, > > From looking at the RDKit code base my take is that is is > currently not possible to generate 64 bit Morgan fingerprints. > > The Python fingerprint generator defaults to 64bit: > > In [36]: fp.GetLength() > Out[36]: 18446744073709551615 > > Unfortunately, the C++ Morgan fingerprint generator only ever sets > the first 32 bits even if the fingerprint is 64bit. If you look > at MorganFingerprints::getConnectivityInvariants and > MorganFingerprints::getFeatureInvariants in > Code/GraphMol/Fingerprints/FingerprintUtil.cpp the generated > invariants (that are used to set the fingerprint bits) are > unsigned 32 bit ints. > > Some RDKit development would be needed to template those functions > so that they would work with both 32 and 64 bit fingerprints. > > Cheers, > > Gareth > > > On 4/21/2021 10:10 PM, Wojtek Plonka wrote: >> Hi Gareth, >> >> Thank you. I do exactly as you wrote. That's not the issue. >> Please note, that all the keys in elements are in range of 2**32 >> - the main hash function used is definitely 32 bit >> >> According to >> https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html >> <https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html> >> both /class >> /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator32| >> and /class >> /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator64| >> exist. >> >> However with my limited knowledge I don't know how to access the >> 64 bit version and that is my problem. >> Kindest regards, >> >> Wojtek >> >> Wojtek Plonka >> +48885756652 >> wojtekplonka.com <http://www.wojtekplonka.com> >> fb.com/wojtek.plonka <https://fb.com/wojtek.plonka> >> >> >> >> On Thu, Apr 22, 2021 at 1:27 AM Gareth Jones >> <jav...@gm... <mailto:jav...@gm...>> wrote: >> >> Wojtek, >> >> You can use GetNonzeroelements() to convert the sparse >> fingerprint to a Python Dict of hash to count. >> >> Cheers, >> Gareth >> >> >> In [7]: mol = Chem.MolFromSmiles('Cn1cnc2n(C)c(=O)n(C)c(=O)c12') >> >> In [8]: fp = AllChem.GetMorganFingerprint(mol, 2) >> >> In [9]: elements = fp.GetNonzeroElements(); >> >> In [10]: elements >> Out[10]: >> {10565946: 2, >> 348155210: 1, >> 476388586: 1, >> 540046244: 1, >> 553412256: 1, >> 864942730: 2, >> 909857231: 1, >> 1100037548: 1, >> 1333761024: 1, >> 1512818157: 1, >> 1981181107: 1, >> 2030573601: 1, >> 2041434490: 1, >> 2092489639: 3, >> 2246728737: 3, >> 2370996728: 1, >> 2877515035: 1, >> 2971716993: 1, >> 2975126068: 2, >> 3140581776: 1, >> 3217380708: 4, >> 3218693969: 1, >> 3462333187: 1, >> 3657471097: 3, >> 3796970912: 1} >> >> In [11]: >> >> On 4/21/2021 5:44 AM, Wojtek Plonka wrote: >>> Dear All >>> >>> Do any of you have a working example of getting Morgan >>> Fingerprints, as sparse bit vector (non-hashed) in the 64 >>> bit version using Python? >>> I'm looking into the issue of collisions on the "main hash" >>> on large (100+ million molecules) data >>> Thank you very much! >>> Kindest regards, >>> >>> Wojtek Plonka >>> +48885756652 >>> wojtekplonka.com <http://www.wojtekplonka.com> >>> fb.com/wojtek.plonka <https://fb.com/wojtek.plonka> >>> >>> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdk...@li... <mailto:Rdk...@li...> >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> <mailto:Rdk...@li...> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> >> >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... <mailto:Rdk...@li...> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > <mailto:Rdk...@li...> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |