[Rdkit-devel] some nice performance improvements

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear all,

I've been making some changes to the SMILES canonicalization code
(more on this later) that have also led to some nice (IMO) performance
improvements. Here are the numbers.

My usual benchmarking operations
(http://code.google.com/p/rdkit/wiki/Benchmarking) don't really help
here: 1000 molecules just aren't enough to see reliable differences.
Here I'm using 25K molecules from the ZINC ZNP subset
(http://zinc.docking.org/subsets/znp). This is a nice test set since
the molecules are of reasonable size and contain plenty of
stereochemistry (double bonds with stereochemistry and chiral
centers).

My tests were:
build1: generate molecules from the sdf
smiles1: generate canonical smiles
smiles2: generate non-canonical smiles
build2: generate molecules from the smiles
build3: generate molecules from the smiles without stereochemistry cleanup
build4: generate molecules from the smiles with very minimal
sanitization (just UpdatePropertyCache() and FastFindRings())

Here's the timing information comparing the new code (still on a
branch) with a couple of previous releases, run on my linux box. This
looks like crap unless you're using a fixed-width font:

|           | build1 | smiles1 | smiles2 | build2 | build3 | build4 |
| 2011_06_1 |   15.4 |     8.1 |     7.0 |   12.5 |        |        |
| 2012_03_1 |   14.6 |     8.0 |     6.9 |    9.9 |    6.9 |    3.8 |
| branch    |   14.3 |     5.9 |     4.4 |    9.7 |    6.6 |    3.5 |

I'm pretty happy with the progress that's being made here. Canonical
SMILES generation is substantially faster than it used to be and the
other operations are showing steady improvement.

I'll be merging the branch back to the trunk in the next few days.

-greg

[Rdkit-devel] some nice performance improvements

Open-Source Cheminformatics and Machine Learning

[Rdkit-devel] some nice performance improvements