This patch adds stereo encoding to the hash code module for the CDK. Currently encoded are 2D and 3D stereo configurations for tetrahedral and double bond stereo centres. Theres a couple of things i need to add later on but the these classes cover the majority of cases.
These commits required part one (patches/611).
commits: 50303dd0de - c249512c48
There are some bits which can be added on later:
- encoding using IStereoElements, allows hashing of stereo chemistry without 2D/3D coordinates (i.e. from SMILES)
- additional geometries, e.g. cumulated double bonds, octahedral and bipyramidal. the cumulated bonds are pretty easy the other might be bit more difficult but would be good to include.
- sulfoxide, cdk atom-type reports Sp2 instead of Sp3 hybridisation and thus is not encoded
- encoders could be more selective for 3D. This isn't really a problem for the 2D stereo as it looks for wedge/hatch bonds but the 3D perception is quite loose at the moment and as such will perceive Sp3 atoms which are not stereogenic (e.g. due to delocalised bonds). InChI has a list of what it accepts but I'm trying to think of an elegant way to do this and not use a decision tree.
- wedge/hatch 2D double bonds, the current code does not check whether a double bond is in a rigid ring or not as it is much faster simply to ignore that. I found a rare case in a bridged system where there are double bonds which are next to two chiral atoms and thus the E/Z configuration is calculated wrongly. In the figure below if you calculate the configuration of the bond on the left it would be E (trans) whilst on the right it would be Z (cis). This issue doesn't exist in 3D and if you simply ignore all bonds in rigid rings but it is easily fixable and so we can still avoid the ring perception :-).
Unfortunately adding the stereo hashing naturally adds a lot of time to the computation. The results below are for 20,000 structures, |V| < 200, depth 8
non-chiral 251.60 ± 5.90 ms (246.95 - 298.61) 1.26e-05 s-1
chiral 477.43 ± 48.98 ms (450.85 - 957.84) 2.39e-05 s-1
If we remove the actual perception part (i.e. clockwise/anticlockwise) and just do the detection we see the detection is the bottle neck. Although the perception is the actual bit that matters.
chiral' 442.42 ± 28.10 ms (425.09 - 654.25) 2.21e-05 s-1
I think this time hit is mainly due to the quadratic time required to get the connected atoms/bonds from an atom container. There's not much we can do about that but I thought it was interesting.