[Rdkit-devel] Rethinking the RDKit's implicit hydrogen handling

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

This is an unusually long email for me, so here's the TL;DR version:
The RDKit handling of implicit hydrogens is confusing. This leads to
needlessly complicated code and, almost certainly, bugs. I'm proposing to
fix it, but doing so requires an API change. This work will be done on a
branch until we decide that it should move over to the trunk.

# Background

Like most cheminformatics toolkits, the RDKit by default stores molecules
without hydrogens being explicitly present in the molecular graph. This is
done for efficency reasons: there's no need to store all those extra atoms
if you can implicitly know that they're there. Thanks to the octet rule,
it's pretty easy to know when Hs should be attached to the common elements
encountered in organic chemistry. This leads to the concept of the
"implicit hydrogen": an H that you know is there but which isn't explicitly
represented.

This is wonderfully clear in SMILES: you can write ethane as "CC" instead
of "[H]C([H])([H])C([H])([H])[H]" or "[CH3][CH3]". For elements out of the
"organic subset" -- where the octet rule may be a bit more relaxed -- or in
cases where the octet rule is violated, you have to provide the H count in
the SMILES. For example: "[SiH3][SiH3]" or the radical "[CH2]C". The rules
for SMILES stipulate that if the atom is in square brackets, for whatever
reason, you must provide the H count. For example "Cl[C@H](C)F". As an
aside: you also sometimes need to do this in aromatic heterocycles, where
it's not always clear where if an H should be present. So pyrrole is
written as "c1cccc[nH]1", not "c1ccccn1".

Back in the very early days of designing and implementing the RDKit, for
reasons I can no longer really remember (give me a break! it was more than
10 years ago!), I decided that there should be a difference between
"specified implicit Hs", those that are inside square brackets, and
"default implicit Hs", those that don't show up in the SMILES at all
(thanks to Andrew Dalke for the terms). Thus there are the methods
Atom.GetNumImplicitHs and Atom.GetNumExplicitHs (I will use the Python API
names throughout this since I'm showing examples from Python). These don't
necessarily behave the way one would expect them to:

     In [2]: m = Chem.MolFromSmiles('C[CH3]')

     In [3]: m.GetAtomWithIdx(0).GetNumImplicitHs()
     Out[3]: 3

     In [4]: m.GetAtomWithIdx(1).GetNumImplicitHs()
     Out[4]: 0

     In [5]: m.GetAtomWithIdx(0).GetNumExplicitHs()
     Out[5]: 0

     In [6]: m.GetAtomWithIdx(1).GetNumExplicitHs()
     Out[6]: 3

You really need to use Atom.GetTotalNumHs() to safely figure out how many
Hs an atom has:

    In [7]: m.GetAtomWithIdx(0).GetTotalNumHs()
    Out[7]: 3

    In [8]: m.GetAtomWithIdx(1).GetTotalNumHs()
    Out[8]: 3

This has been bothering me for a long time, but fixing it requires a change
to the API which will break code. I'm really, really resistant to doing
that, but I don't see any way to clean things up without an API change.

# Fixing the problem

Here's a preliminary list of API changes:

The following methods will be removed from the API:
  - Atom.{Get,Set}NumExplicitHs()
  - Atom.{Get,Set}NoImplicit()
The following methods will be added:
  - Atom.{Get,Set}NoImplicitCalculation(): equivalent to the current
Atom.{Get,Set}NoImplicit()
  - Atom.SetNumImplicitHs()
The following methods will be modified:
  - Atom.GetTotalNumHs(): the optional "includeNeighbors" argument will be
removed; it will always include neighbors.

Things may be added to this over time.

# Practical considerations

It's a pretty small API change, but there's a huge amount of code that
needs to be changed in the back and lots and lots of testing that has to be
done, so this is going to take a while.

I'm currently working on a branch (note: this is definitely not yet stable
enough to rely on it working at all):
https://svn.code.sf.net/p/rdkit/code/branches/AlternateValence_21Nov2012

My current plan is to keep this branch in sync with the trunk through (at
least) the Q4 2012 release and then consider moving it onto the trunk and
creating a v1 API branch in Q1 of next year. I guess I should be able to
keep the v1 API branch in sync, at least in terms of bug fixes, for another
2-3 releases.

Please let me know if this sounds reasonable or if you have suggestions on
how to handle the logistics more cleanly.

# Thanks

Roger Sayle, perhaps inadvertently, provide the final bit of impetus
required to make this happen by submitting a nice little patch to update
the way implicit valence is calculated. Integrating this simple patch
forced me to confront how ugly the current handling of implicit Hs is.

Best Regards,
-greg

[Rdkit-devel] Rethinking the RDKit's implicit hydrogen handling

Open-Source Cheminformatics and Machine Learning

[Rdkit-devel] Rethinking the RDKit's implicit hydrogen handling