[Rdkit-devel] Speeding up SMILES parsing
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2012-08-29 03:24:40
|
Over the past couple of days I've spent some time doing some tuning of the RDKit's SMILES parser. I made a couple of minor changes here and there and saw some improvement before making a change in the YACC grammar used to generate the parser. This made the parser source a bit more difficult to read, but had a pretty significant impact on performance. In order to just measure performance of the SMILES parser, I did a benchmark using ~560K molecules from ZINC where I generated a molecule from SMILES without any sanitization. Here are the timings on my linux box for that benchmark: RDKit_2011_06_1: 50.6s RDKit_2012_03_1: 49.6s RDKit_2012_06_1: 57.6s [ <- I'm not sure I understand this outlier] svn: 30.6s I'm pretty pleased about that last number. :-) For those who are interested, here's the commit: https://sourceforge.net/p/rdkit/code/2159/ and the specific grammar changes that made the difference: https://sourceforge.net/p/rdkit/code/2159/tree//trunk/Code/GraphMol/SmilesParse/smiles.yy?diff=502dda6571b75b41b4b10063:2158 -greg |