From: Syed A. R. <as...@eb...> - 2011-12-19 12:19:46
|
Hi, I had encountered with similar problems in the past. a) The present CDK hashed fingerprint does not discriminate between open and closed ring system but its very fast b) Extended fingerprint does the job but the code needs some tweaking. Apart from this all the fingerprint codes suffer from bit clashes (universal truth...) hence its hard to get one to one mapping between Graph isomorphism and fingerprint. The best strategy for screening is to use fingerprints (generic in nature not pharmacophore) to generate an ensemble of potential hits. On this ensemble run graph isomorphism to eliminate false positives. I have optimised the CDK hashed fingerprint and it's fast, minimises the bit clashed and discriminates between rings systems if asked to do so. Here is the code https://github.com/asad/CDKHashFingerPrint/blob/master/src/fingerprints/HashedFingerprinter.java All you need is following steps a) global fingerprints.interfaces.IFingerprinter fingerprint1 = new fingerprints.HashedFingerprinter(1024); fingerprint1.setRespectRingMatches(true); b) function private static BitSet getHashedFingerprint(IAtomContainer ac) throws CDKException { return fingerprint1.getFingerprint(ac); } If you intesrted in the benchmark code, you can find it https://github.com/asad/CDKHashFingerPrint Hope this helps. Asad On 19 Dec 2011, at 10:40, cdk...@li... wrote: > Send Cdk-user mailing list submissions to > cdk...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/cdk-user > or, via email, send a message with subject or body 'help' to > cdk...@li... > > You can reach the person managing the list at > cdk...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Cdk-user digest..." > > > Today's Topics: > > 1. SDF Problem (lochana menikarachchi) > 2. Re: Correctness of Fingerprinters and > UniversalisomorphismTester (Joos Kiener) > 3. Re: Correctness of Fingerprinters and > UniversalisomorphismTester (Egon Willighagen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 18 Dec 2011 09:17:35 -0800 (PST) > From: lochana menikarachchi <loc...@ya...> > Subject: [Cdk-user] SDF Problem > To: "cdk...@li..." <cdk...@li...> > Message-ID: > <132...@we...> > Content-Type: text/plain; charset="us-ascii" > >> Can you create a MDL V2000 molfile with FC(Cl)BrI as compound, make > >> sure it has the stereo field, *and* let me know if that file >> represents the R or S species? Then I will write a patch to add this >> functionality. > > Attached compound CID_79058.sdf is S isomer of FC(Cl)BrH downloaded from pubchem. However, it is not just stereo information missing. Look at the second compound (73393) and see how cdk writes that compound. both columns 6 (charge) and 7 (stereo parity) get lost. Some programs rely on information on these columns and it is nice to have SDF in the same format as oechem and Marvin. Also CDK writes CHG cards differently (Check 73393) > > Thanks. > > Lochana > -------------- next part -------------- > An HTML attachment was scrubbed... > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: CID_73393.sdf > Type: application/octet-stream > Size: 6227 bytes > Desc: not available > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: CID_79058.sdf > Type: application/octet-stream > Size: 2017 bytes > Desc: not available > > ------------------------------ > > Message: 2 > Date: Mon, 19 Dec 2011 11:10:20 +0100 > From: Joos Kiener <jo...@su...> > Subject: Re: [Cdk-user] Correctness of Fingerprinters and > UniversalisomorphismTester > To: cdk...@li... > Message-ID: > <CAH...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hi all, > > any comments on this? > > 2011/12/8 Joos Kiener <jo...@su...> > >> Hi all, >> >> if have a question regarding how it is verified that Fingerprinters >> actually works correctly as well as Universalisomorphism Tester? >> >> The Question is related to the cdk based project I'm working on which I >> will "officially release" once I believe it is usable enough. >> >> I use UIT for Subgraph matching and the ExtendedFingerprinter. I had the >> feeling that the fingerprint wasn't especially great at least for the used >> dataset (Part of Subset 13 of ZINC) and hence I wanted to try out the >> PubchemFingerprinter which I did put now I was getting different amount of >> search hits than before. See below tables. I'm now wondering if it is a bug >> on my part or in the Fingerprints and/or UIT. How can I determine the >> actually correct result? Especially since the reference also disagrees with >> UIT. >> >> PubchemFingerprinter: >> >> SMILES Screening Hits Hits >> CCC(C)C(C)C(C)C 8599 344 >> O(C)C(C)C(C)C(C)C 938 28 >> CCCCCC(C)CC 9227 1547 >> N(C)(C)CC(C)C 15861 8893 >> O(CC)C(N(C)C)C 1365 83 >> CC(C)C(C)C(C(C)C)C(C)C 8599 0 >> >> ExtendedFingerprinter >> >> SMILES Screening Hits Hits >> CCC(C)C(C)C(C)C 22488 429 >> O(C)C(C)C(C)C(C)C 9398 77 >> CCCCCC(C)CC 3955 1603 >> N(C)(C)CC(C)C 88301 10917 >> O(CC)C(N(C)C)C 1588 74 >> CC(C)C(C)C(C(C)C)C(C)C 22488 0 >> >> No Screening, just UIT: >> >> SMILES Hits >> CCC(C)C(C)C(C)C 436 >> O(C)C(C)C(C)C(C)C 77 >> CCCCCC(C)CC 2171 >> N(C)(C)CC(C)C 11412 >> O(CC)C(N(C)C)C 139 >> CC(C)C(C)C(C(C)C)C(C)C 0 >> >> As a Reference the same Searches were done in ChemFinder over the same >> Data Set >> >> SMILES Hits Found in ChemFinder >> CCC(C)C(C)C(C)C 427 >> O(C)C(C)C(C)C(C)C 77 >> CCCCCC(C)CC 1825 >> N(C)(C)CC(C)C 11412 >> O(CC)C(N(C)C)C 109 >> CC(C)C(C)C(C(C)C)C(C)C 0 >> >> Best Regards, >> >> Joos >> > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 3 > Date: Mon, 19 Dec 2011 11:40:10 +0100 > From: Egon Willighagen <ego...@gm...> > Subject: Re: [Cdk-user] Correctness of Fingerprinters and > UniversalisomorphismTester > To: Joos Kiener <jo...@su...> > Cc: cdk...@li... > Message-ID: > <CAM...@ma...> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi Joos, > > a short, quick reply... I will not have time to look in detail into > the issue in the next two weeks... > > On Thu, Dec 8, 2011 at 12:47 PM, Joos Kiener <jo...@su...> wrote: >> The Question is related to the cdk based project I'm working on which I will >> "officially release" once I believe it is usable enough. > > That would be the 1.4 series. > >> I use UIT for Subgraph matching and the ExtendedFingerprinter. I had the >> feeling that the fingerprint wasn't especially great at least for the used >> dataset (Part of Subset 13 of ZINC) and hence I wanted to try out the >> PubchemFingerprinter which I did put now I was getting different amount of >> search hits than before. See below tables. I'm now wondering if it is a bug >> on my part or in the Fingerprints and/or UIT. How can I determine the >> actually correct result? Especially since the reference also disagrees with >> UIT. >> >> PubchemFingerprinter: >> >> SMILES??? ??? ??? ??? ??? Screening Hits??? Hits >> CCC(C)C(C)C(C)C??? ??? ??? ?? 8599??? ?? ? 344 >> >> ExtendedFingerprinter >> >> SMILES??? ??? ??? ??? ??? Screening Hits??? Hits >> CCC(C)C(C)C(C)C??? ??? ??? ??? 22488??????? 429 >> >> No Screening, just UIT: >> >> SMILES????????????????????????????????????????????? Hits >> CCC(C)C(C)C(C)C??????????????????????????????? 436 >> >> As a Reference the same Searches were done in ChemFinder over the same Data >> Set >> >> SMILES??? ??? ??? ??? ??? ??? Hits Found in ChemFinder >> CCC(C)C(C)C(C)C??? ??? ??? ??? ??? ??? ??? ? 427 > > So, one would expect to find 436 with the CDK for each of the three > approaches. The difference with 427 in ChemFinder can have many > reasons (preprocessing, their substructure matching, ...) and am not > eager to hypothesize on why that is different. > > It is indeed worrying to see that apparently the PubchemFingerprinter > and ExtendedFingerprinter miss out on a true positives. Can you > identify those structures? Maybe to start with the seven that the > ExtendedFingerprinter doesn't find. Then we can start debugging why > those are not found... > > Egon > > -- > Dr E.L. Willighagen > Postdoctoral Researcher > Institutet f?r milj?medicin > Karolinska Institutet (http://ki.se/imm) > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > > > > ------------------------------ > > ------------------------------------------------------------------------------ > Learn Windows Azure Live! Tuesday, Dec 13, 2011 > Microsoft is holding a special Learn Windows Azure training event for > developers. It will provide a great way to learn Windows Azure and what it > provides. You can attend the event by watching it streamed LIVE online. > Learn more at http://p.sf.net/sfu/ms-windowsazure > > ------------------------------ > > _______________________________________________ > Cdk-user mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-user > > > End of Cdk-user Digest, Vol 67, Issue 8 > *************************************** |