[Rdkit-devel] some nice performance improvements
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2012-06-29 05:24:56
|
Dear all, I've been making some changes to the SMILES canonicalization code (more on this later) that have also led to some nice (IMO) performance improvements. Here are the numbers. My usual benchmarking operations (http://code.google.com/p/rdkit/wiki/Benchmarking) don't really help here: 1000 molecules just aren't enough to see reliable differences. Here I'm using 25K molecules from the ZINC ZNP subset (http://zinc.docking.org/subsets/znp). This is a nice test set since the molecules are of reasonable size and contain plenty of stereochemistry (double bonds with stereochemistry and chiral centers). My tests were: build1: generate molecules from the sdf smiles1: generate canonical smiles smiles2: generate non-canonical smiles build2: generate molecules from the smiles build3: generate molecules from the smiles without stereochemistry cleanup build4: generate molecules from the smiles with very minimal sanitization (just UpdatePropertyCache() and FastFindRings()) Here's the timing information comparing the new code (still on a branch) with a couple of previous releases, run on my linux box. This looks like crap unless you're using a fixed-width font: | | build1 | smiles1 | smiles2 | build2 | build3 | build4 | | 2011_06_1 | 15.4 | 8.1 | 7.0 | 12.5 | | | | 2012_03_1 | 14.6 | 8.0 | 6.9 | 9.9 | 6.9 | 3.8 | | branch | 14.3 | 5.9 | 4.4 | 9.7 | 6.6 | 3.5 | I'm pretty happy with the progress that's being made here. Canonical SMILES generation is substantially faster than it used to be and the other operations are showing steady improvement. I'll be merging the branch back to the trunk in the next few days. -greg |