When doing some round tripping testing in OpenBabel 2.3.0 I came across the following cases where hydrogen was erroneously added to atoms with explicitly no hydrogen in the conversion from SMILES to InChI
Group 1/2 metals and In/Sn/Sb/Tl/Pb/Bi/Po/At.
e.g. [Li] -->InChI=1S/Li.H instead of InChI=1S/Li
[Bi] --> InChI=1S/Bi.3H instead of InChI=1S/Bi
Possibly relatedly [BiH4] gives InChI=1S/Bi.5H instead of InChI=1S/Bi.4H
I(=O)=O also ends up with an extra hydrogen on conversion to InChI
S(C)(C)C exhibits a fault in SMILES reading, the S is interpreted as having 0 rather than 1 hydrogen.
[SeH-] seems to erroneously pick up H+ when converted to InChI
[Se]=P([O-])([O-])[O-] produces InChI=1S/O3PSe/c1-4(2,3)5/q-3 but should produce InChI=1S/H3O3PSe/c1-4(2,3)5/h(H3,1,2,3,5)/p-3
Not too sure what's gone wrong in that case. There were a few other similar cases.
At least some of these problems don't exist with the current dev version. Can you try these with the dev version?
BTW, how exactly are you finding these problem cases? If we could monitor our performance on the dataset with time we could ensure we don't cause regressions....
I will give it a try on the latest version and report back. I suppose really I should have a bash at fixing the problems rather than giving you work :-P
The set I was using was based on the ChEBI database as of about 9 months ago. What I was actually hoping to round trip was not OpenBabel, but a SMILES writer I have just written. The hope being to convert from SMILES to InChI and check that its output is equivalent to the InChI output from my program. I did indeed find a few corner cases my SMILES writer failed which I've now fixed but I haven't gone through all the discrepancies....mainly as most of them were coming from OpenBabel :-P
Such a set could be created directly from ChEBI although I have a feeling that for some records the InChI and SMILES don't quite correspond to each other e.g. differences like connected vs disconnected [H+] on acids, which may give false failures.
The problems with atoms having explicitly no hydrogen has been fixed.
The following issues remain:
[BiH4] is interpreted as [BiH5]. The reason is on line 344 of inchiformat.cpp InChI is being allowed to implicitly add hydrogens.
The [Se]=P([O-])([O-])[O-] , [SeH-] and S(C)(C)C problems remain.
[BiH4] nows gives InChI=1S/Bi.4H
[SeH-] can be roundtripped through InChI
This leaves only [Se]=P([O-])([O-])[O-] (see below).
The S(C)(C)C I will file as a separate bug as it has nothing to do with all of this.