I think there is a general problem in dealing with stereochemistry with SMILES strings, and that this bug is tied to [ 1742876 ] and [ 1792878 ], and possibly to a bunch of closed bugs. We are looking to use OpenBabel to generate database primary keys. We are parsing a vendor-supplied database, which includes both molfile and SMILES descriptors for compounds. For chiral compounds, OpenBabel appears to perform well when converting the vendor-supplied molfile data to either InChI or canonical SMILES.
OpenBabel seems ( caveat below ) to have a very high failure rate when it uses the vendor-supplied SMILES string, though. I tested 53 small (SMILES string <= 20 characters) chiral molecules, and about half (26 - included as attachment) appear to disrupt the stereochemistry when inputed as SMILES. The disruption was visible either when converting to InChI (10 cases), to canonical SMILES (8 cases) or both (8 cases):
Failed both InChI and Canonical SMILES:
C[C@@H]1Cc2ccccc2CN1
C[C@H]1Cc2ccccc2CN1
FC[C@H]1Cc2ccccc2CN1
N[C@@H]1CCSC1=O
N[C@H]1CCSC1=O
OC(=O)C[C@@H]1CCNC1
OC(=O)C[C@H]1CCNC1
OC[C@H]1Cc2ccccc2CN1
Failed InChI:
C\C=N/OC[C@H]1CCCN1C
CN(C)CC#C[C@H]1CCCN1
CN1CCC[C@H]1c2cccnc2
CP(O)(=O)CC@HCN
FC[C@@H]1CCCN1
N[C@H]1CCn2cccc2C1
NC[C@@H]1CC(=O)NO1
NC[C@H]1CC(=O)NO1
NC[C@H]1COc2ccccc2O1
OC(=O)[C@@H]1CCCN1
Failed SMILES:
CC(C)C@HC(O)=O
CCC@HC([O-])=O
COC@@Hc1ccccc1
COC@Hc1ccccc1
NC@@HC(O)=O
NC@HC(O)=O
NCCCC@HC(O)=O
NCCSCC@HC(O)=O
Command being used: babel ---errorlevel 2 -ocan -ismi <string>
(or -oinchi for inchi output)</string>
Open Babel 2.1.1 -- Sep 28 2007 -- 18:13:29
Altix Itanium
Intel(R) C++ Itanium(R) Compiler for Itanium(R)-based applications
Version 8.0 Build 20031017 Package ID: l_cc_p_8.0.056
Copyright (C) 1985-2003 Intel Corporation. All rights reserved.
To be honest, I don't have many independent mechanisms to verify that OpenBabel is the point of failure in all cases; I can't read the SMILES 'by eye', and I don't have access to other internal tools to test the chirality. I detect 'disruption' when the OpenBabel generated key (either 'can' or 'inchi' format) differs depending on if I use the molfile or SMILES string supplied by the vendor; it's possible that the vendor has not done the molfile-to-SMILES mapping properly, and that the chirality differs within the vendor DB.
So I am passing judgment on the molecules based on graphical visualization provided by PubChem:
http://pubchem.ncbi.nlm.nih.gov/edit/
... and whenever I've spot-checked a disruption by this (unwieldy) mechanism, the chirality of the vendor-supplied SMILES matches that of the key generated by OB via the vendor-supplied molfile, but not the OB key made from the vendor-supplied SMILES (I can't be sure that there is not a bug within the PubChem tool, though).
In a few cases, converting to InChI seems to demonstrate indisputable lost of stereochemical information. The following examples completely lack the InChI stereo layer after OpenBabel conversion:
CN1CCC[C@H]1c2cccnc2
InChI=1/C10H14N2/c1-12-7-3-5-10(12)9-4-2-6-11-8-9/h2,4,6,8,10H,3,5,7H2,1H3
CP(O)(=O)CC@HCN
InChI=1/C4H12NO3P/c1-9(7,8)3-4(6)2-5/h4,6H,2-3,5H2,1H3,(H,7,8)
Each throws "Omitted undefined stereo" during conversion, yet AFAIK the SMILES string is explicitly defining a chiral center - [C@H] in both cases - which is reliably recognized by PubChem.
It is possible that these problems are due to our build (my colleague who compiled the package had to wrestle with it quite a bit), but I see the same errors on an older public OpenBabel (2.0.0) server (Murray-Rust Research Group):
http://wwmm-svc.ch.cam.ac.uk/wwmm/html/observer.html
... for example, the two InChI examples above also loose the stereo layer when run through Murray-Rust's build (I tried to locate an open OB 2.1.1 web portal but only came across some SOAP services, which I still haven't fully grokked).
The problems are seen with larger molecules as well; I'm providing examples in small molecules under the assumption that it would make debugging easier.
I'm seeing two general discrepancies - the total loss of a chiral center, or flipping of a chiral center (CW <-> CCW). A nice feature of using InChI as the primary key is that you can isolate the layer in which two molecules disagree; the list below shows the layer(s) that differed when InChI was generated from the vendor's molfile or SMILEs string (which is also listed):
MolFile SMILES
FC[C@H]1Cc2ccccc2CN1
stereo:sp3:inverted m1 m0
C[C@H]1Cc2ccccc2CN1
stereo:sp3:inverted m0 m1
C[C@@H]1Cc2ccccc2CN1
stereo:sp3:inverted m1 m0
CN1CCC[C@H]1c2cccnc2
stereo:sp3 t10- -UNDEF-
stereo:sp3:inverted m0 -UNDEF-
stereo:type s1 -UNDEF-
CN(C)CC#C[C@H]1CCCN1
stereo:sp3:inverted m1 m0
NC[C@H]1COc2ccccc2O1
stereo:sp3:inverted m0 m1
OC(=O)[C@@H]1CCCN1
stereo:sp3:inverted m0 m1
OC[C@H]1Cc2ccccc2CN1
stereo:sp3:inverted m1 m0
N[C@H]1CCn2cccc2C1
stereo:sp3:inverted m0 m1
CP(O)(=O)CC@HCN
stereo:sp3 t4- -UNDEF-
stereo:sp3:inverted m1 -UNDEF-
stereo:type s1 -UNDEF-
NC[C@H]1CC(=O)NO1
stereo:sp3:inverted m1 m0
NC[C@@H]1CC(=O)NO1
stereo:sp3:inverted m0 m1
N[C@H]1CCSC1=O
stereo:sp3:inverted m0 m1
N[C@@H]1CCSC1=O
stereo:sp3:inverted m1 m0
OC(=O)C[C@H]1CCNC1
stereo:sp3:inverted m1 m0
OC(=O)C[C@@H]1CCNC1
stereo:sp3:inverted m0 m1
C\C=N/OC[C@H]1CCCN1C
stereo:sp3:inverted m1 m0
FC[C@@H]1CCCN1
stereo:sp3:inverted m0 m1
I'm happy to provide more technical details of our build; it would be nice if someone can verify (or disprove) the behavior on a non-Altix 2.1.1 install.
26 small molecules that alter (2.1.1) stereochemistry on export to 'can' and/or 'inchi'
Logged In: YES
user_id=1189615
Originator: NO
Thank you for taking the trouble to providing such a detailed description.
There is more than one bug here.
The SMILES parser was not ordering the atoms correctly when one or more of the atoms attached to a chiral center was a ring closure.
The loss of chiral info occurred in FindChiralCenters(). Calling this was unnecessary and the SMILES parse now avoids this by calling SetChiralityPerceived().
These two corrections are in smilesformat.cpp rev 2065 (which compatible with v2.1.1) and SMILES to InChI should now be ok.
There still seems to be a bug in Canonical SMILES output and this has not yet been corrected. The third set of molecules listed in the original bug report are examples.
Logged In: YES
user_id=1514992
Originator: YES
Excellent, thank you! We applied smilesformat.cpp rev 2065 and now all 26 test cases provided pass without complaint.
I expanded my testing net to include slightly larger molecules ( <= 30 SMILES characters, 891 chiral molecules tested) and came up with a new set of discrepancies (total of 12). As before I can't be certain in all cases that it's not a discrepancy hard-coded in the MolFile / SMILES columns of the database, but I hand checked a couple that seem to be a problem within OpenBabel.
In the first case (attached as OB01_DoubleRing.mol) the SMILES string from the vendor database converted fine, both apparently to canonical SMILES and InChI. The MolFile converted to canonical SMILES ok, but lost the stereochemistry on one of the two chiral centers when converted to InChI. If I read the connection table in the MolFile correctly, the two carbons in question are atom 2 (between the two oxygens, which are atoms 3 and 11) and atom 1 (methyl group); that bond appears to be explicitly specified as "down" chirality in the 2-1 bond row ('6'). For reference, the vendor-supplied SMILES string is C[C@@H]1OC[C@]2(CCN(C)C2)O1 while the OpenBabel canonical form is CN1CC[C@]2(COC@@HO2)C1
The second problem is with a fused ring (not sure that's the correct terminology), molfile attached as OB02_FusedRing.mol. The molecule techincally has three chiral centers, but two of them may only have one R/L form sterically accessible (?). The molfile specifies chirality for all three. OB export to InChI results in loss of chirality information on all three centers ( InChI=1/C8H15NO/c1-9-6-2-3-7(9)5-8(10)4-6/h6-8,10H,2-5H2,1H3 ). OB export to canonical SMILES losses chirality on the hydroxyl. Working with the vendor supplied SMILES string ( CN1[C@@H]2CC[C@H]1CC@@HC2 ) as input instead of the molfile results in a chirality flip on the hydroxyl when output as canonical smiles ( O[C@@H]3C[C@H]4CCC@@HN4C ), but proper chirality is conserved with InChI output ( InChI=1/C8H15NO/c1-9-6-2-3-7(9)5-8(10)4-6/h6-8,10H,2-5H2,1H3/t6-,7+,8+ ).
It's a funky structure; it has a plane of symmetry through the hydroxyl, so I suppose it's a structural stereoisomer rather than specifically chiral, but I'm not sure given that the ring is constrained (? - would it be able to invert back-and-forth?). For that matter, I'm not sure that the stereochemistry describing the bonds to the nitrogen is totally sane in the molfile.
I have other examples, but from scanning I found a few cases that I'm pretty certain are vendor-embedded differences (charge variation in a couple cases). Anyway if needed I can try to dredge up other examples.
Chiral center lost on MolFile import
Logged In: YES
user_id=1514992
Originator: YES
File Added: OB01_DoubleRing.mol
All chiral specification lost when converting MolFile to InChI, chiral flip converting to canonical SMILES
Logged In: YES
user_id=1514992
Originator: YES
File Added: OB02_FusedRing.mol
Chris said "There still seems to be a bug in Canonical SMILES output and this has not
yet been corrected. The third set of molecules listed in the original bug
report are examples. "
This bug is fixed for those SMILES strings that do not have a ring on the chiral centre (r2929). The others need more work. I'm on it.
Noel, are these cases fixed in the current 2.2.2? I'm curious if we can finally close this.
This bug report covers every aspect of stereochemistry in OB. For example, there's reading 2D MOL files, InChI and SMILES. The SMILES should all be sorted in 2.2.2, 2D Mol are not supported (AFAIK), and I'm not sure how well reading InChI stereochem works.
I'm leaving it open but marking 2.3.x.
Noel, I think certainly the Molfile / SMILES bugs should be fixed with the stereo code. Should we make this bug into a new testcase?
I'll revisit this now.
Finally closed. The final hurdle was 2D -> InChI. Regarding test cases, I'd prefer to start off with simpler structures and be methodological about it.
For the record, I now have the following SMILES and InChI for the two test cases, OB01_DoubleRing/mol and OB02_FusedRing.mol respectively:
CN1CC[C@]2(COC@@HO2)C1
InChI=1S/C8H15NO2/c1-7-10-6-8(11-7)3-4-9(2)5-8/h7H,3-6H2,1-2H3/t7-,8-/m1/s1
O[C@H]1C[C@H]2CCC@@HN2C
InChI=1S/C8H15NO/c1-9-6-2-3-7(9)5-8(10)4-6/h6-8,10H,2-5H2,1H3/t6-,7+,8+