Menu

#729 Aromatic P not recognised in SMILES

open
nobody
None
5
2015-02-18
2011-06-02
No

From Andrew Dalke on list:

Perhaps I'm missing something after staring at fingerprint SMARTS definitions for the last few days. I'm validating the MACCS substructure keys from RDKit, which are also used in OpenBabel and CDK.

I'm writing a test suite, which will be public when done. (Actually, they are public now, if you know where the version control repository is.)

I'm having a very difficult time generating an aromatic ring with a "P" in it in OpenBabel.

import pybel
pybel.readstring("smi", "c1cccp1").write()
'C1CCCP1\t\n'
pybel.readstring("smi", "c1ccccp1").write()
'C1=CC=NC=P1\t\n'

Since P is in the same group and has the same valence levels as N, I expected the first of these to return "c1cccp1", similar to

pybel.readstring("smi", "c1cccn1").write()
'c1ccc[nH]1\t\n'

Both RDKit and OEChem have no problem dealing with "c1cccp" and interpreting it as an aromatic ring.

I processed about 50K structures from PubChem to find a number with aromatic "p" in them. Since PubChem doesn't have aromaticity information, what I did was use another program to perceive the aromaticity. Below I show the RDKit SMILES for a structure and the OpenBabel equivalent for it.

You can see that of the 53 structures where RDKit has no problems with a "p" in an aromatic ring, 51 of them are converted into aliphatic form by OpenBabel.

Is this due to a chemical reason or a design reason for why OpenBabel does this? Perhaps it's something subtle about aromaticity perception (which I sadly admit I still don't have a good grasp on).

This is with OEChem OBReleaseVersion() '2.3.0' which I built a couple of days ago.

                           Andrew
                           dalke@dalkescientific.com

Columns are
column 1: "p" in OpenBabel's SMILES
column 2: the SMILES string from RDKit
column 3: the SMILES string from OpenBabel

False 'CCc1c(CC)p(-c2ccccc2)c(-c2ccccc2)c1-c1ccccc1' 'CCC1C(CC)P(C2CCCCC2)C(C2CCCCC2)C1C1CCCCC1\t\n'
True '[W].Cc1np(C(Si(C)C)Si(C)C)nc1N1CCCCC1.[O+]#[C-].[C-]#[O+].[O+]#[C-].[C-]#[O+].[C-]#[O+]' '[W].Cc1[nH]p(C(Si(C)C)Si(C)C)nc1N1CCCCC1.[O+]#[C-].[C-]#[O+].[O+]#[C-].[C-]#[O+].[C-]#[O+]\t\n'
True 'Cc1np(C(Si(C)C)Si(C)C)nc1N1CCCCC1' 'Cc1[nH]p(C(Si(C)C)Si(C)C)nc1N1CCCCC1\t\n'
False 'c1ccc2c(c1)ccc1op(OC(C)CC(C)Op3oc4ccc5ccccc5c4c4c5ccccc5ccc4o3)oc3ccc4ccccc4c3c21' 'C1CCC2C(C1)CCC1OP(OC(C)CC(C)OP3OC4CCC5CCCCC5C4C4C5CCCCC5CCC4O3)OC3CCC4CCCCC4C3C21\t\n'
False 'Cc1cp(-c2ccccc2)c(Br)c1C' 'CC1CP(C2CCCCC2)C(Br)C1C\t\n'
False 'CCC(C)(C)c1c2c(pc(C(OC)=O)c1C(OC)=O)CCCCCC2' 'CCC(C)(C)C1=C2C(=PC(=C1C(=O)OC)C(=O)OC)CCCCCC2\t\n'
False '[Zr+2].CCC(C)(C)[c-]1p2c-p12.[CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH][CH]1' '[Zr+2].CCC(C)(C)[C-]1P2=P1[C-]2C(CC)(C)C.[CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH][CH]1\t\n'
False 'Cc1cccc2c1op(OC1COC3C(Op4oc5c(C)cccc5c5c(c(C)ccc5)o4)COC13)oc1c2cccc1C' 'CC1CCCC2C1OP(OC1COC3C(OP4OC5C(C)CCCC5C5C(C(C)CCC5)O4)COC13)OC1C2CCCC1C\t\n'
False 'c1cc2c(cc1)c(=O)op+o2' 'c1cc2c(cc1)C(=O)OP+O2\t\n'
False 'c1ccc(-c2cc(-c3ccccn3)cpc2)nc1' 'c1ccc(C2=CC(=CP=C2)c2ccccn2)nc1\t\n'
False 'c1csc(-c2psc(-c3ccccc3)c2)c1' 'c1csc(C2=PSC(=C2)c2ccccc2)c1\t\n'
False 'CC(Np1oc2ccc3c(cccc3)c2c2c(o1)ccc1c2cccc1)c1ccccc1' 'CC(NP1OC2CCC3C(CCCC3)C2C2C(O1)CCC1C2CCCC1)C1CCCCC1\t\n'
False '[Zr+2].[CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH][CH]1.C1C2CC3CC1CC([c-]1p4c-C6)p14)(C2)C3' '[Zr+2].[CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH][CH]1.C1C2CC3CC1CC([C-]1P4=P1[C-]4C14CC5CC(CC(C5)C1)C4)(C2)C3\t\n'
False 'c1ccc(P(C2C(Op3oc4ccc5c(cccc5)c4c4c(o3)ccc3c4cccc3)COC2)c2ccccc2)cc1' 'c1ccc(P(C2C(OP3OC4CCC5C(CCCC5)C4C4C(O3)CCC3C4CCCC3)COC2)C2CCCCC2)cc1\t\n'
False 'Cc1c(C)c(C)p(Cc2ccccc2Cp2c(C)c(C)c(C)c2C)c1C' 'CC1C(C)C(C)P(CC2CCCCC2CP2C(C)C(C)C(C)C2C)C1C\t\n'
False 'CCCN(C)p1oc2ccc3c(c2c2c(ccc4c2CCCC4)o1)CCCC3' 'CCCN(C)P1OC2CCC3C(C2C2C(CCC4C2CCCC4)O1)CCCC3\t\n'
False 'c1ccc2c(c1)cc(C)c1op(NN3CCCCC3)oc3c(C)cc4ccccc4c3c21' 'C1CCC2C(C1)CC(C)C1OP(NN3CCCCC3)OC3C(C)CC4CCCCC4C3C21\t\n'
False 'CCOC(=O)C=C(C)Np1oc2ccc3c(c2c2c(ccc4c2CCCC4)o1)CCCC3' 'CCOC(=O)C=C(C)NP1OC2CCC3C(C2C2C(CCC4C2CCCC4)O1)CCCC3\t\n'
False 'CCCCN(p1oc2ccc3c(c2c2c(o1)ccc1c2CCCC1)CCCC3)CCCC' 'CCCCN(P1OC2CCC3C(C2C2C(O1)CCC1C2CCCC1)CCCC3)CCCC\t\n'
False 'c1ccc2c(c1)cccc2CNp1oc2ccc3c(c2c2c(o1)ccc1c2CCCC1)CCCC3' 'c1ccc2c(c1)cccc2CNP1OC2CCC3C(C2C2C(O1)CCC1C2CCCC1)CCCC3\t\n'
False 'Cc1cc(C)c2op(N(C(C)c3ccccc3)C(C)c3ccccc3)oc3c(C)cc(C)cc3c2c1' 'CC1CC(C)C2OP(N(C(C)C3CCCCC3)C(C)C3CCCCC3)OC3C(C)CC(C)CC3C2C1\t\n'
False 'COc1cc(C)cc2c1op(N(C(C)c1ccccc1)C(C)c1ccccc1)oc1c(OC)cc(C)cc12' 'COC1CC(C)CC2C1OP(N(C(C)C1CCCCC1)C(C)C1CCCCC1)OC1C(OC)CC(C)CC21\t\n'
False 'Cc1cc(C)cc(P(CCOp2oc3c(C(C)(C)C)cc(C)c(C)c3c3c(C)c(C)cc(C(C)(C)C)c3o2)c2cc(C)cc(C)c2)c1' 'Cc1cc(C)cc(P(CCOP2OC3C(C(C)(C)C)CC(C)C(C)C3C3C(C)C(C)CC(C(C)(C)C)C3O2)C2CC(C)CC(C)C2)c1\t\n'
False 'CCN(CC)[p+]1c(P(=S)(c2ccccc2)c2ccccc2)c(-c2ccccc2)cc(-c2ccccc2)c1P(=S)(c1ccccc1)c1ccccc1' 'CCN(CC)[P+]1=C(P(=S)(c2ccccc2)c2ccccc2)C(=CC(=C1P(=S)(c1ccccc1)c1ccccc1)c1ccccc1)c1ccccc1\t\n'
False 'c1ccc(CCNp2oc3c(C)cc4ccccc4c3c3c(o2)c(C)cc2ccccc23)nc1' 'c1ccc(CCNP2OC3C(C)CC4CCCCC4C3C3C(O2)C(C)CC2CCCCC32)nc1\t\n'
False 'CN(C)p1n(S(C)(=O)=O)c2ccc3ccccc3c2c2c(ccc3ccccc23)n1S(C)(=O)=O' 'CN(C)P1N(S(=O)(=O)C)C2CCC3CCCCC3C2C2C(CCC3CCCCC23)N1S(=O)(=O)C\t\n'
False 'Cc1cc(C)c2op(N(C(C)c3ccccc3)C(C)c3ccccc3)oc3c(C)cc(C)c(C)c3c2c1C' 'CC1CC(C)C2OP(N(C(C)C3CCCCC3)C(C)C3CCCCC3)OC3C(C)CC(C)C(C)C3C2C1C\t\n'
False 'CC(=C)Cc1cccc2c1op(N(C(C)c1ccccc1)C(C)c1ccccc1)oc1c(CC(C)=C)cccc12' 'CC(=C)CC1CCCC2C1OP(N(C(C)C1CCCCC1)C(C)C1CCCCC1)OC1C(CC(=C)C)CCCC21\t\n'
False 'COC1COC(c2ccccc2)OC1C1OC(c2ccccc2)OCC1Op1oc2ccc3ccccc3c2c2c3ccccc3ccc2o1' 'COC1COC(c2ccccc2)OC1C1OC(c2ccccc2)OCC1OP1OC2CCC3CCCCC3C2C2C3CCCCC3CCC2O1\t\n'
False 'CC(N(p1n(S(C)(=O)=O)c2ccc3ccccc3c2c2c(ccc3ccccc23)n1S(C)(=O)=O)C(C)c1ccccc1)c1ccccc1' 'CC(N(P1N(S(=O)(=O)C)C2CCC3CCCCC3C2C2C(CCC3CCCCC23)N1S(=O)(=O)C)C(C)c1ccccc1)c1ccccc1\t\n'
False 'CC(C)N(C(C)C)p1n(S(c2ccc(C)cc2)(=O)=O)c2ccc3ccccc3c2c2c(ccc3ccccc23)n1S(c1ccc(C)cc1)(=O)=O' 'CC(C)N(C(C)C)P1N(S(=O)(=O)c2ccc(C)cc2)C2CCC3CCCCC3C2C2C(CCC3CCCCC23)N1S(=O)(=O)c1ccc(C)cc1\t\n'
False '[Pd+2].[CH2][CH][CH2].FC(F)(F)S([O-])(=O)=O.c1ccc(P(COp2oc3ccc4c(cccc4)c3c3c(o2)ccc2c3cccc2)c2ccccc2)cc1' '[Pd+2].[CH2][CH][CH2].FC(F)(F)S(=O)(=O)[O-].c1ccc(P(COP2OC3CCC4C(CCCC4)C3C3C(O2)CCC2C3CCCC2)C2CCCCC2)cc1\t\n'
False 'c1ccc(P(COp2oc3ccc4c(cccc4)c3c3c(o2)ccc2c3cccc2)c2ccccc2)cc1' 'c1ccc(P(COP2OC3CCC4C(CCCC4)C3C3C(O2)CCC2C3CCCC2)C2CCCCC2)cc1\t\n'
False 'c1ccc(C2C(Op3oc4ccccc4c4ccccc4o3)CCCC2)cc1' 'c1ccc(C2C(OP3OC4CCCCC4C4CCCCC4O3)CCCC2)cc1\t\n'
False 'CC(C)(C)Np1oc2ccc3c(c2c2c(ccc4c2CCCC4)o1)CCCC3' 'CC(C)(C)NP1OC2CCC3C(C2C2C(CCC4C2CCCC4)O1)CCCC3\t\n'
False 'COCCNp1oc2c(C)cc3ccccc3c2c2c(c(C)cc3ccccc32)o1' 'COCCNP1OC2C(C)CC3CCCCC3C2C2C(C(C)CC3CCCCC23)O1\t\n'
False '[Li+].[W].Cc1c[p-]cc1C.[C-]#[O+].[O+]#[C-].[C-]#[O+].[O+]#[C-].[O+]#[C-]' '[Li+].[W].CC1C[PH-]CC1C.[C-]#[O+].[O+]#[C-].[C-]#[O+].[O+]#[C-].[O+]#[C-]\t\n'
False 'COCC1N(p2oc3c(C)cc4ccccc4c3c3c(c(C)cc4ccccc43)o2)CCC1' 'COCC1N(P2OC3C(C)CC4CCCCC4C3C3C(C(C)CC4CCCCC34)O2)CCC1\t\n'

Discussion