#1181 Different smiles for phosphate groups

cdk-1.4.x
closed
nobody
5
2013-09-23
2011-09-27
No

Mol files from Kegg and MetaCyc for Phosphatidyl-myo-inositol 4,5-bisphosphate produce different smiles because the =O in the phosphate group is written in a different order (apparently).

Using Marvin (with the attached files in the zip) I get the same Smiles:

O[C@@H]1C@HC@@HC@HC@@H[C@@H]1OP(O)(=O)OCC@@HOC([*])=O
O[C@@H]1C@HC@@HC@HC@@H[C@@H]1OP(O)(=O)OCC@@HOC([*])=O

Using jchempaint 3.0 I get the same smiles, which is fine:

[*]C(=O)OCC(OC([*])=O)COP(=O)(O)OC1C(O)C(O)C(OP(=O)(O)O)C(OP(=O)(O)O)C1(O)
[*]C(=O)OCC(OC([*])=O)COP(=O)(O)OC1C(O)C(O)C(OP(=O)(O)O)C(OP(=O)(O)O)C1(O)

But using the Smiles generator directly on those mols I get

[*]C(=O)OCC(COP(O)(=O)OC1C(O)C(O)C(OP(O)(O)=O)C(OP(=O)(O)O)C1(O))OC([*])=O
[*]C(=O)OCC(COP(=O)(O)OC1C(O)C(O)C(OP(=O)(O)O)C(OP(=O)(O)O)C1(O))OC([*])=O

The following test code snippet shows the behaviour (using the files in the zip file)

@Test
public void testSmilesDirectly() {
System.out.println("TestSmilesDirectly");
IAtomContainer molBioCyc=null;
IAtomContainer molKegg=null;
try {
MDLReader reader = new MDLReader(new FileReader("/Users/pmoreno/Documents/Sep2011/molUnifierTests/cdkSmilesBug/biocyc_version.mol"));
molBioCyc = (IAtomContainer) reader.read(DefaultChemObjectBuilder.getInstance().newInstance(Molecule.class));
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(molBioCyc);
reader.close();
reader.setReader(new FileReader("/Users/pmoreno/Documents/Sep2011/molUnifierTests/cdkSmilesBug/Kegg_version.mol"));
molKegg = (IAtomContainer) reader.read(DefaultChemObjectBuilder.getInstance().newInstance(Molecule.class));
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(molKegg);
} catch (IOException e) {
fail("Problems reading any of the files");
} catch (CDKException e) {
fail("CDK problems when reading mol files");
}
SmilesGenerator generator = new SmilesGenerator();
String smilesBioCyc = generator.createSMILES(molBioCyc);
System.out.println("BioCyc:\t"+smilesBioCyc);
String smilesKegg = generator.createSMILES(molKegg);
System.out.println("Kegg :\t"+smilesKegg);
assertEquals(smilesBioCyc, smilesKegg);
}

I tried with and without the atom typing, but it doesn't make any difference. I wonder what is JChemPaint doing to get it right.

Discussion

  • Pablo Moreno

    Pablo Moreno - 2011-09-27

    Mol files from Kegg and BioCyc for the Phosphatidyl-myo-inositol 4,5-bisphosphate

     
  • Egon Willighagen

    I tried your unit test, and confirm it giving different SMILES. The resulting SMILES I looked at with the AMBIT SMILES depicter, and look identical to me. The files have different information, which could cause it, but at this moment I do not know which information would cause the different canonicalization...

    I slightly changed your unit test, as found attached: no catching of exceptions, and let JUnit take care of that; the files are in the V2000 version of the MDL molfile format, so uses the MDLV2000Reader, and IChemObjectReader's should not be reused.

    But just to be clear, none of this changed the outcome...

     
  • Pablo Moreno

    Pablo Moreno - 2011-09-28

    Thanks for the answer Egon. I did find a few other examples, not all of them with phosphates, but the same pattern of =O displaced:

    Nacylneuraminate9p*.mol:
    BioCyc: [*]C(=O)NC1C(O)CC(O)(OC1(C(O)C(O)COP(O)(O)=O))C(=O)O
    Kegg : [*]C(=O)NC1C(O)CC(O)(OC1(C(O)C(O)COP(O)(O)=O))C(O)=O

    NacylLaspartate*.mol:
    BioCyc: [*]C(=O)NC(CC(=O)O)C(=O)O
    Kegg : [*]C(=O)NC(CC(O)=O)C(O)=O

    cerebroside3Sulfate*.mol
    BioCyc: [*]C(=O)NC(COC1OC(CO)C(O)C(OS(=O)(=O)O)C1(O))C(O)C=CCCCCCCCCCCCCC
    Kegg : [*]C(=O)NC(COC1OC(CO)C(O)C(OS(=O)(O)=O)C1(O))C(O)C=CCCCCCCCCCCCCC

    3OxoAcid*.mol
    BioCyc: [*]C(=O)CC(=O)O
    Kegg : [*]C(=O)CC(O)=O

    Stephan B. in my group said that he tried the first example I supplied in windows (CDK 1.4.2) and he got identical smiles (?). I must say I have seen different behaviour of CDK related java codes in different architectures (but didn't dig up enough to see whether it was really something in CDK or my own code). By the way, I'm running on macosx and CDK 1.3.8 (but Egon already reproduced the error in CDK 1.4.2 Linux I guess). Mol files for these examples are in moreExamplesCDKSmilesBug.zip .

    I tried most of the pairs of examples with marvin, getting the same smiles.

     
  • Pablo Moreno

    Pablo Moreno - 2011-09-28

    A few more pairs of example mol files where different smiles are generated.

     
  • Till Schäfer

    Till Schäfer - 2012-12-03

    Bug 1237 is related to this bug

     
    Last edit: Till Schäfer 2012-12-03
  • John May

    John May - 2012-12-03

    Hi Till, thanks for linking to the related bug.

    I opened up the canonical labeller guts the other day for a bug in the comparators and when I saw you update the tracker I knew what the issue was. Sorry it's taken so long.

    To fix it, you need to add implicit hydrogens (which aren't added via atom typing).

    CDKHydrogenAdder.getInstance(biocyc.getBuilder()).addImplicitHydrogens(biocyc);
    CDKHydrogenAdder.getInstance(kegg.getBuilder()).addImplicitHydrogens(kegg);
    

    []C(=O)OCC(OC([])=O)COP(=O)(O)OC1C(O)C(O)C(OP(=O)(O)O)C(OP(=O)(O)O)C1(O)
    []C(=O)OCC(OC([])=O)COP(=O)(O)OC1C(O)C(O)C(OP(=O)(O)O)C(OP(=O)(O)O)C1(O)

    Tada!

    Okay, so what's going on. Well the canonical labeller doesn't use the bond order in the initial invariants so without implicit hydrogens it sees the oxygens around the phosphate as identical. We can print the initial invariants to show what is going on. The O4 , O11 and O12 are three oxygens around a phosphates. I have indicated these in the invariant lists.

    Here are initial invariants for the biocyc version without hydrogens:

    O1: 118000
    O2: 118000
    O3: 118000
    O4: 118000 // P=O
    O5: 118000
    O6: 118000
    O7: 118000
    O8: 118000
    O9: 118000
    O10: 118000
    O11: 118000 // P-O
    O12: 118000 // P-O
    O13: 118000
    R14: 110000
    R15: 110000
    C16: 226000
    C17: 226000
    O18: 228000
    O19: 228000
    O20: 228000
    O21: 228000
    O22: 228000
    O23: 228000
    C24: 336000
    C25: 336000
    C26: 336000
    C27: 336000
    C28: 336000
    C29: 336000
    C30: 336000
    C31: 336000
    C32: 336000
    P33: 4415000
    P34: 4415000
    P35: 4415000

    and with implicit hydrogens

    O1: 118000
    O2: 118000
    O3: 118000
    O4: 118000 // P=O
    O5: 118000
    O6: 218001
    O7: 218001
    O8: 218001
    O9: 218001
    O10: 218001
    O11: 218001 // P-O
    O12: 218001 // P-O
    O13: 218001
    R14: 110000
    R15: 110000
    C16: 426002
    C17: 426002
    O18: 228000
    O19: 228000
    O20: 228000
    O21: 228000
    O22: 228000
    O23: 228000
    C24: 336000
    C25: 336000
    C26: 436001
    C27: 436001
    C28: 436001
    C29: 436001
    C30: 436001
    C31: 436001
    C32: 436001
    P33: 4415000
    P34: 4415000
    P35: 4415000

    When there are no implicit hydrogens the labeller assigns the same initial values to the oxygens. These oxygens are all next to same phosphate and thus can never be differentiated. As the labeller identifies them as the identical the output order is random.

    Hope it helps

     
  • Till Schäfer

    Till Schäfer - 2012-12-05

    It helped, thx a lot.

     
  • John May

    John May - 2013-09-23
    • status: open --> closed