Thread: [Rdkit-discuss] Isomeric smiles and explicit hydrogens
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Noel O'B. <bao...@gm...> - 2008-04-14 10:50:25
|
I've been trying to get my head around what's happening when I read and write isomeric smiles. As a user, I hope that the same molecule will also have the same isomeric SMILES. However, look at the following examples using cinfony which read a SMILES string and write an isomeric SMILES string... I'm trying to specify the chirality of the carbon in chlorobromomethane, but RDKit is not picking up on the chirality: >>> rdk.readstring("smi", "[C](Cl)Br").write("iso") 'ClCBr' (No chirality, as expected) >>> rdk.readstring("smi", "[C@@H](Cl)Br").write("iso") 'Cl[CH]Br' >>> rdk.readstring("smi", "[C@](Cl)Br").write("iso") 'ClCBr' >>> rdk.readstring("smi", "Cl[C@]Br").write("iso") 'ClCBr' >>> rdk.readstring("smi", "Cl[C@@H]Br").write("iso") 'Cl[CH]Br' (Expected chirality, but didn't get it) Let's try 1-chloro,1-bromoethane: >>> rdk.readstring("smi", "Cl[C@@](Br)C").write("iso") 'CC(Cl)Br' (Expected chirality, but didn't get it) >>> rdk.readstring("smi", "Cl[C@@H](Br)C").write("iso") 'C[C@@H](Cl)Br' (Expected chirality, and got it) Is the problem with me or with RDKit? On a related note, I have found that RDKit, when reading SDF files, turns all of the hydrogens into implicit hydrogens. However, when reading SMILES strings, it retains any explicit hydrogens specified in C@@H expressions. This doesn't seem to be consistent and requires the user to remove hydrogens if he/she wants to create a canonical smiles string. Apologies in advance if my understanding of SMILES is shaky. Regards, Noel |
From: Noel O'B. <bao...@gm...> - 2008-04-14 14:48:02
|
I think I've been misunderstanding the square brackets. I need to RTFM, I think, after which I'll post here again if still confused. Noel On 14/04/2008, Noel O'Boyle <bao...@gm...> wrote: > I've been trying to get my head around what's happening when I read > and write isomeric smiles. As a user, I hope that the same molecule > will also have the same isomeric SMILES. However, look at the > following examples using cinfony which read a SMILES string and write > an isomeric SMILES string... > > I'm trying to specify the chirality of the carbon in > chlorobromomethane, but RDKit is not picking up on the chirality: > > >>> rdk.readstring("smi", "[C](Cl)Br").write("iso") > 'ClCBr' > (No chirality, as expected) > > >>> rdk.readstring("smi", "[C@@H](Cl)Br").write("iso") > 'Cl[CH]Br' > >>> rdk.readstring("smi", "[C@](Cl)Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@]Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@@H]Br").write("iso") > 'Cl[CH]Br' > (Expected chirality, but didn't get it) > > Let's try 1-chloro,1-bromoethane: > > >>> rdk.readstring("smi", "Cl[C@@](Br)C").write("iso") > 'CC(Cl)Br' > (Expected chirality, but didn't get it) > >>> rdk.readstring("smi", "Cl[C@@H](Br)C").write("iso") > 'C[C@@H](Cl)Br' > (Expected chirality, and got it) > > Is the problem with me or with RDKit? > > On a related note, I have found that RDKit, when reading SDF files, > turns all of the hydrogens into implicit hydrogens. However, when > reading SMILES strings, it retains any explicit hydrogens specified in > C@@H expressions. This doesn't seem to be consistent and requires the > user to remove hydrogens if he/she wants to create a canonical smiles > string. > > Apologies in advance if my understanding of SMILES is shaky. > > Regards, > > Noel > |
From: Noel O'B. <bao...@gm...> - 2008-04-14 14:51:49
|
And (egg on face) chlorobromomethane isn't chiral in the first place...what was I thinking? On 14/04/2008, Noel O'Boyle <bao...@gm...> wrote: > I've been trying to get my head around what's happening when I read > and write isomeric smiles. As a user, I hope that the same molecule > will also have the same isomeric SMILES. However, look at the > following examples using cinfony which read a SMILES string and write > an isomeric SMILES string... > > I'm trying to specify the chirality of the carbon in > chlorobromomethane, but RDKit is not picking up on the chirality: > > >>> rdk.readstring("smi", "[C](Cl)Br").write("iso") > 'ClCBr' > (No chirality, as expected) > > >>> rdk.readstring("smi", "[C@@H](Cl)Br").write("iso") > 'Cl[CH]Br' > >>> rdk.readstring("smi", "[C@](Cl)Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@]Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@@H]Br").write("iso") > 'Cl[CH]Br' > (Expected chirality, but didn't get it) > > Let's try 1-chloro,1-bromoethane: > > >>> rdk.readstring("smi", "Cl[C@@](Br)C").write("iso") > 'CC(Cl)Br' > (Expected chirality, but didn't get it) > >>> rdk.readstring("smi", "Cl[C@@H](Br)C").write("iso") > 'C[C@@H](Cl)Br' > (Expected chirality, and got it) > > Is the problem with me or with RDKit? > > On a related note, I have found that RDKit, when reading SDF files, > turns all of the hydrogens into implicit hydrogens. However, when > reading SMILES strings, it retains any explicit hydrogens specified in > C@@H expressions. This doesn't seem to be consistent and requires the > user to remove hydrogens if he/she wants to create a canonical smiles > string. > > Apologies in advance if my understanding of SMILES is shaky. > > Regards, > > Noel > |
From: Greg L. <gre...@gm...> - 2008-04-14 16:25:46
|
Hi Noel, You already figured out the problem with the chirality of chlorobromomethane, but I want to clarify a couple of things below. On Mon, Apr 14, 2008 at 12:50 PM, Noel O'Boyle <bao...@gm...> wrote: > > I'm trying to specify the chirality of the carbon in > chlorobromomethane, but RDKit is not picking up on the chirality: > > >>> rdk.readstring("smi", "[C](Cl)Br").write("iso") > 'ClCBr' > (No chirality, as expected) Just to be clear on this one, the output here is not technically correct; you've input a molecule with the formula CClBr (you told the software that the C has no implicit Hs by putting it in square brackets), the output however is for something with the formula CH2ClBr. This is actually a bug; thanks for finding it. :-) https://sourceforge.net/tracker/index.php?func=detail&aid=1942220&group_id=160139&atid=814650 > >>> rdk.readstring("smi", "[C@@H](Cl)Br").write("iso") > 'Cl[CH]Br' > >>> rdk.readstring("smi", "[C@](Cl)Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@]Br").write("iso") > 'ClCBr' > >>> rdk.readstring("smi", "Cl[C@@H]Br").write("iso") > 'Cl[CH]Br' > (Expected chirality, but didn't get it) As you've realized: this molecule isn't chiral, so the RDKit is doing the right thing by not marking chirality. It's doing something arguable with the canonical smiles though, because it's showing the explicit H (inside the square brackets). If you input exactly the same molecule as ClCBr, you'd get a different canonical smiles. This is a known oddity of the way things are currently handled internally and I haven't quite figured out a solution yet. Basically explicit Hs remain always explicit, even if they don't need to be. > Let's try 1-chloro,1-bromoethane: > > >>> rdk.readstring("smi", "Cl[C@@](Br)C").write("iso") > 'CC(Cl)Br' > (Expected chirality, but didn't get it) Again, the molecule as provided isn't chiral because carbon 1 only has three neighbors (you've told it that there are no implicit Hs). > >>> rdk.readstring("smi", "Cl[C@@H](Br)C").write("iso") > 'C[C@@H](Cl)Br' > (Expected chirality, and got it) It's even the right chirality, which is good to see. :-) > Is the problem with me or with RDKit? I'll answer that "or" question with a "yes", because it's a little of both. :-) > On a related note, I have found that RDKit, when reading SDF files, > turns all of the hydrogens into implicit hydrogens. correct. > However, when > reading SMILES strings, it retains any explicit hydrogens specified in > C@@H expressions. This doesn't seem to be consistent and requires the > user to remove hydrogens if he/she wants to create a canonical smiles > string. I commented on this above. It's a known problem and I've been stewing over how to solve it for a while. Now that someone other than me is complaining I'll bump it up a bit in priority. -greg |
From: Noel O'B. <bao...@gm...> - 2008-04-14 19:12:57
|
If I found a bug earlier, it was completely by accident. The following though I think is also a bug. I find that I can invert the stereocenter by adding and removing Hs. >>> mol = rdk.readstring("smi", "C[C@@H](O)(Cl)c1ccccc1") >>> mol.write("iso") 'C[C@@H](O)(Cl)c1ccccc1' >>> mol.addh() >>> mol.write("iso") '[H]O[C@](Cl)(C([H])([H])[H])([H])c1c([H])c([H])c([H])c([H])c1[H]' >>> mol.removeh() >>> mol.write("iso") 'C[C@H](O)(Cl)c1ccccc1' Can you tell whether the problem is when I add the Hs, or when I remove them? I might be able to workaround if the adding is working okay. Noel |
From: Greg L. <gre...@gm...> - 2008-04-14 20:04:05
|
On Mon, Apr 14, 2008 at 9:12 PM, Noel O'Boyle <bao...@gm...> wrote: > If I found a bug earlier, it was completely by accident. The following > though I think is also a bug. I find that I can invert the > stereocenter by adding and removing Hs. > > >>> mol = rdk.readstring("smi", "C[C@@H](O)(Cl)c1ccccc1") > >>> mol.write("iso") > 'C[C@@H](O)(Cl)c1ccccc1' > >>> mol.addh() > >>> mol.write("iso") > '[H]O[C@](Cl)(C([H])([H])[H])([H])c1c([H])c([H])c([H])c([H])c1[H]' > >>> mol.removeh() > >>> mol.write("iso") > 'C[C@H](O)(Cl)c1ccccc1' > > Can you tell whether the problem is when I add the Hs, or when I > remove them? I might be able to workaround if the adding is working > okay. As discussed in your later message, this molecule has a 5-coordinate C, so it probably shouldn't have the @ in the output SMILES at all. (Sarcasm doesn't work in email: that "probably" is a joke, it definitely shouldn't be in there; that's another nice bug). I'm prepared to believe that there could be a bug that causes inversion of chirality when Hs are added and removed (I wouldn't be overly surprised), but it definitely doesn't always happen, as this case demonstrates: [18]>>> m = Chem.MolFromSmiles('O[C@H](F)Cl') [19]>>> Chem.MolToSmiles(m,1) Out[19] 'O[C@H](F)Cl' [20]>>> m2=Chem.AddHs(m) [21]>>> Chem.MolToSmiles(m2) Out[21] '[H]OC(F)(Cl)[H]' [22]>>> Chem.MolToSmiles(m2,True) Out[22] '[H]O[C@](F)(Cl)[H]' [23]>>> m3 = Chem.RemoveHs(m2) [24]>>> Chem.MolToSmiles(m3,True) Out[24] 'O[C@H](F)Cl' After playing around a bit with a model, I think this is also ok: [25]>>> m = Chem.MolFromSmiles('C[C@@H](O)Cl') [27]>>> Chem.MolToSmiles(m,True) Out[27] 'C[C@@H](O)Cl' [28]>>> m2 = Chem.AddHs(m) [30]>>> Chem.MolToSmiles(m2,True) Out[30] '[H]O[C@@](Cl)(C([H])([H])[H])[H]' [31]>>> m3 = Chem.RemoveHs(m2) [32]>>> Chem.MolToSmiles(m3,True) Out[32] 'C[C@@H](O)Cl' -greg |
From: Noel O'B. <bao...@gm...> - 2008-04-14 19:33:26
|
Wait a second, that molecule has five substituents on the isomeric C. But I think we share the blame again this time, Greg, because I took that structure from the RDKit Python tutorial Section 2.3. :-) Noel On 14/04/2008, Noel O'Boyle <bao...@gm...> wrote: > If I found a bug earlier, it was completely by accident. The following > though I think is also a bug. I find that I can invert the > stereocenter by adding and removing Hs. > > >>> mol = rdk.readstring("smi", "C[C@@H](O)(Cl)c1ccccc1") > >>> mol.write("iso") > 'C[C@@H](O)(Cl)c1ccccc1' > >>> mol.addh() > >>> mol.write("iso") > '[H]O[C@](Cl)(C([H])([H])[H])([H])c1c([H])c([H])c([H])c([H])c1[H]' > >>> mol.removeh() > >>> mol.write("iso") > 'C[C@H](O)(Cl)c1ccccc1' > > Can you tell whether the problem is when I add the Hs, or when I > remove them? I might be able to workaround if the adding is working > okay. > > > Noel > |
From: Greg L. <gre...@gm...> - 2008-04-14 19:54:30
|
On Mon, Apr 14, 2008 at 9:33 PM, Noel O'Boyle <bao...@gm...> wrote: > Wait a second, that molecule has five substituents on the isomeric C. > But I think we share the blame again this time, Greg, because I took > that structure from the RDKit Python tutorial Section 2.3. :-) Indeed. That's a documentation bug. I'll fix it. There's also something bad in general going on with the handling of organic-subset atoms in square brackets that I'm going to have to track down (I think the five coordinate neutral C should have caused an error). Thanks for reporting it. -greg |