[Rdkit-devel] Rethinking the RDKit's implicit hydrogen handling
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2012-12-08 06:01:55
|
Hi, This is an unusually long email for me, so here's the TL;DR version: The RDKit handling of implicit hydrogens is confusing. This leads to needlessly complicated code and, almost certainly, bugs. I'm proposing to fix it, but doing so requires an API change. This work will be done on a branch until we decide that it should move over to the trunk. # Background Like most cheminformatics toolkits, the RDKit by default stores molecules without hydrogens being explicitly present in the molecular graph. This is done for efficency reasons: there's no need to store all those extra atoms if you can implicitly know that they're there. Thanks to the octet rule, it's pretty easy to know when Hs should be attached to the common elements encountered in organic chemistry. This leads to the concept of the "implicit hydrogen": an H that you know is there but which isn't explicitly represented. This is wonderfully clear in SMILES: you can write ethane as "CC" instead of "[H]C([H])([H])C([H])([H])[H]" or "[CH3][CH3]". For elements out of the "organic subset" -- where the octet rule may be a bit more relaxed -- or in cases where the octet rule is violated, you have to provide the H count in the SMILES. For example: "[SiH3][SiH3]" or the radical "[CH2]C". The rules for SMILES stipulate that if the atom is in square brackets, for whatever reason, you must provide the H count. For example "Cl[C@H](C)F". As an aside: you also sometimes need to do this in aromatic heterocycles, where it's not always clear where if an H should be present. So pyrrole is written as "c1cccc[nH]1", not "c1ccccn1". Back in the very early days of designing and implementing the RDKit, for reasons I can no longer really remember (give me a break! it was more than 10 years ago!), I decided that there should be a difference between "specified implicit Hs", those that are inside square brackets, and "default implicit Hs", those that don't show up in the SMILES at all (thanks to Andrew Dalke for the terms). Thus there are the methods Atom.GetNumImplicitHs and Atom.GetNumExplicitHs (I will use the Python API names throughout this since I'm showing examples from Python). These don't necessarily behave the way one would expect them to: In [2]: m = Chem.MolFromSmiles('C[CH3]') In [3]: m.GetAtomWithIdx(0).GetNumImplicitHs() Out[3]: 3 In [4]: m.GetAtomWithIdx(1).GetNumImplicitHs() Out[4]: 0 In [5]: m.GetAtomWithIdx(0).GetNumExplicitHs() Out[5]: 0 In [6]: m.GetAtomWithIdx(1).GetNumExplicitHs() Out[6]: 3 You really need to use Atom.GetTotalNumHs() to safely figure out how many Hs an atom has: In [7]: m.GetAtomWithIdx(0).GetTotalNumHs() Out[7]: 3 In [8]: m.GetAtomWithIdx(1).GetTotalNumHs() Out[8]: 3 This has been bothering me for a long time, but fixing it requires a change to the API which will break code. I'm really, really resistant to doing that, but I don't see any way to clean things up without an API change. # Fixing the problem Here's a preliminary list of API changes: The following methods will be removed from the API: - Atom.{Get,Set}NumExplicitHs() - Atom.{Get,Set}NoImplicit() The following methods will be added: - Atom.{Get,Set}NoImplicitCalculation(): equivalent to the current Atom.{Get,Set}NoImplicit() - Atom.SetNumImplicitHs() The following methods will be modified: - Atom.GetTotalNumHs(): the optional "includeNeighbors" argument will be removed; it will always include neighbors. Things may be added to this over time. # Practical considerations It's a pretty small API change, but there's a huge amount of code that needs to be changed in the back and lots and lots of testing that has to be done, so this is going to take a while. I'm currently working on a branch (note: this is definitely not yet stable enough to rely on it working at all): https://svn.code.sf.net/p/rdkit/code/branches/AlternateValence_21Nov2012 My current plan is to keep this branch in sync with the trunk through (at least) the Q4 2012 release and then consider moving it onto the trunk and creating a v1 API branch in Q1 of next year. I guess I should be able to keep the v1 API branch in sync, at least in terms of bug fixes, for another 2-3 releases. Please let me know if this sounds reasonable or if you have suggestions on how to handle the logistics more cleanly. # Thanks Roger Sayle, perhaps inadvertently, provide the final bit of impetus required to make this happen by submitting a nice little patch to update the way implicit valence is calculated. Integrating this simple patch forced me to confront how ugly the current handling of implicit Hs is. Best Regards, -greg |