#612 Hash Code Module [part two]

John May

This patch adds stereo encoding to the hash code module for the CDK. Currently encoded are 2D and 3D stereo configurations for tetrahedral and double bond stereo centres. Theres a couple of things i need to add later on but the these classes cover the majority of cases.

These commits required part one (patches/611).

branch: feature/hash-stereo
commits: 50303dd0de - c249512c48

There are some bits which can be added on later:

  • encoding using IStereoElements, allows hashing of stereo chemistry without 2D/3D coordinates (i.e. from SMILES)
  • additional geometries, e.g. cumulated double bonds, octahedral and bipyramidal. the cumulated bonds are pretty easy the other might be bit more difficult but would be good to include.

Known issues:

  • sulfoxide, cdk atom-type reports Sp2 instead of Sp3 hybridisation and thus is not encoded
  • encoders could be more selective for 3D. This isn't really a problem for the 2D stereo as it looks for wedge/hatch bonds but the 3D perception is quite loose at the moment and as such will perceive Sp3 atoms which are not stereogenic (e.g. due to delocalised bonds). InChI has a list of what it accepts but I'm trying to think of an elegant way to do this and not use a decision tree.
  • wedge/hatch 2D double bonds, the current code does not check whether a double bond is in a rigid ring or not as it is much faster simply to ignore that. I found a rare case in a bridged system where there are double bonds which are next to two chiral atoms and thus the E/Z configuration is calculated wrongly. In the figure below if you calculate the configuration of the bond on the left it would be E (trans) whilst on the right it would be Z (cis). This issue doesn't exist in 3D and if you simply ignore all bonds in rigid rings but it is easily fixable and so we can still avoid the ring perception :-).


Time Penalty:

Unfortunately adding the stereo hashing naturally adds a lot of time to the computation. The results below are for 20,000 structures, |V| < 200, depth 8

non-chiral    251.60 ± 5.90 ms      (246.95 - 298.61)       1.26e-05 s-1
chiral        477.43 ± 48.98 ms     (450.85 - 957.84)       2.39e-05 s-1

If we remove the actual perception part (i.e. clockwise/anticlockwise) and just do the detection we see the detection is the bottle neck. Although the perception is the actual bit that matters.

chiral'       442.42 ± 28.10 ms     (425.09 - 654.25)       2.21e-05 s-1

I think this time hit is mainly due to the quadratic time required to get the connected atoms/bonds from an atom container. There's not much we can do about that but I thought it was interesting.



  • John May

    John May - 2013-02-20

    I have added an extra commit which provides perturbed hash codes. These hash codes allow you to discriminate molecules which contain atoms with uniform atom environments and resolve symmetric stereo chemistry.

    same branche: feature/hash-stereo
    commit: d32f9e282f

    Last edit: John May 2013-02-20
  • John May

    John May - 2013-03-07

    I've added another commit on feature/hash-stereo which allows us to do an identity hash on PubChem-Compound.

  • Egon Willighagen

    Applied, pushed, and released as part of 1.5.2.

  • Egon Willighagen

    • status: open --> closed
    • Group: Needs_Review --> Accepted

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks