From: Thomas S. <beg...@ho...> - 2011-02-24 10:30:25
|
Hi all, me again. I now tried 2 new things: 1) store Atom counts of certain common Atoms (C,H,O,N,P,S) in database and use it for filtering (eg. use it in SQL where clause). Problem: Can't filter based on H Atoms because not all P and S can be "typed" correctly and hence the CDKHydrogenAdder fails and H-Count for the total molecule is wrong -> ignore H Count Still it doesn't help much. ExtendedFingerprinter is good enough hence in most cases only very few additional molecules are filtered. I'm aware that the UIT itself includes such a filter but since the actually Molecule creation is slow, I thought it might be good to do it before actually creating a Molecule. 2) store all bonds and atoms + their fields in database and create Molecules from those and not from mol file. Kind of a hassle and it seems to be slower than creating from mol file. It could of course be improved because I saw a lot of these fields are null in my case and do not affect search. But even then I doubt the gain would be huge. I tried this with MySQL too, which not unexpectedly is slower than even cached HSQLDB tables. The general way I do this is: Molecule Table Atom Table Bond Table Bond table has 2 foreign keys from atom table and 1 from molecule table. Atom table links to Molecule table. Select all Atoms, create them and put them in a HashMap with a key = database primary key. Select all Bonds, create them and set the Bonds atoms using above HashMap. add Atoms and Bonds to a new Molecule Instance. An issue is that JDBC returns 0 for null values for doubles. So code includes tons of snippet like this: double bondOrderSum = rs.getDouble("bondOrderSum"); if (bondOrderSum == 0) { if (!rs.wasNull()) { atom.setBondOrderSum(bondOrderSum); } } else { atom.setBondOrderSum(bondOrderSum); } Removing all these for fields that in my case seem to be null anyway might help. But I doubt it would make it a lot faster (like more 2x faster). Regards, Thomas > Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class > From: ste...@eb... > Date: Fri, 21 Jan 2011 13:35:24 +0100 > CC: j.k...@cm...; cdk...@li...; jel...@gm... > To: beg...@ho... > > Hi Thomas, > > very interesting indeed. Thanks a lot. > > Cheers, > > Chris > > > On 21 Jan 2011, at 12:06, Thomas Strunz wrote: > > > > > Hi all, > > > > just a few short comments if anyone is interested: > > > > • I'm using multiple threads to read + create the Molecules. On an i7 870 @stock CPU (my personal desktop, 4 cores) search speed basically scales 100% going from 2 to 4 of such reader threads (while having only 1 that does the graph matching). In numbers: I'm searching for benzene in 65'000 records ( a subset of a subset of the Zinc Databasehttp://zinc.docking.org/). The search returns 37914 hits in about 22 s (2 threads) and about 11 s(4 threads). > > The issue is that the search only returns the ID so all hits will have to be created a second time for displaying. > > > > • I created SMILES from above Molecules. Creating Molecules from SMILES is slightly slower than from Molfiles, hence search speed is slower too. The difference is around 15%. > > > > • I will probably make the source available some time. Have never done this and not sure how to proceed especially concerning licenses of all the used libraries and which one to use myself. Also I would probably need to create some kind of QuickStart Guide / Documentation to make it usable/understandable. So this could take some time. > > > > Regards, > > > > Thomas > > > > > > ------------------------------------------------------------------------------ > > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! > > Finally, a world-class log management solution at an even better price-free! > > Download using promo code Free_Logger_4_Dev2Dev. Offer expires > > February 28th, so secure your free ArcSight Logger TODAY! > > http://p.sf.net/sfu/arcsight-sfd2d_______________________________________________ > > Cdk-user mailing list > > Cdk...@li... > > https://lists.sourceforge.net/lists/listinfo/cdk-user > > > -- > Dr. Christoph Steinbeck > Head of Chemoinformatics and Metabolism > European Bioinformatics Institute (EBI) > Wellcome Trust Genome Campus > Hinxton, Cambridge CB10 1SD UK > Phone +44 1223 49 2640 > > Video meliora proboque deteriora sequor. > ... Ovid, Metamorphoses VII, 20/21 > |