#1013 Canonical SMILES not unique

cdk-1.2.x
closed
5
2012-11-03
2009-11-15
CharlieZhu
No

Samples from Daylight theory page http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
Converting from these normal SMILES
OC(=O)C(Br)(Cl)N
ClC(Br)(N)C(=O)O
O=C(O)C(N)(Br)Cl

Two different Canonical SMILES yielded
O=C(O)C(Cl)(Br)N
O=C(O)C(Br)(Cl)N

Charlie

Discussion

  • Rajarshi Guha
    Rajarshi Guha
    2009-11-15

    Unit test added

     
  • Rajarshi Guha
    Rajarshi Guha
    2009-11-15

    On investigation it appears that the problem stems from the fact that the initial labeling of the Cl and Br are identical. As a result the initial sort and ranking will give Cl and Br the same inv pair of 11000. However, as the algorithm expands the neighbor hood - the nbrhood of both atoms is identical. Thus the final inv pair depends on whether Cl or Br came first in the original SMILES.

     
  • Rajarshi Guha
    Rajarshi Guha
    2009-11-15

    Atcually, this is easily fixed, by noting that if the input molecule does not have its atomic number configured, the atomic num prortion of the initial inv label is 0 - this is wrong. Instead, if it's not configured we pull the atomic num from the PeriodicTable and then carry on. As a result, can smiles are identical. I'll upload a patch to fix this in a bit

     
  • Rajarshi Guha
    Rajarshi Guha
    2009-11-15

    The patch will take a while to work out due to issues not-related ot this bug. But a simple solution is to ensure that the molecule is appropriately configured - in this case ensure that atomic numbers are confiogured by doing:

    IsotopeFactory fact = IsotopeFactory.getInstance(DefaultChemObjectBuilder.getInstance());
    fact.configureAtoms(molecule);

     
  • Rajarshi Guha
    Rajarshi Guha
    2009-11-15

    As a follow on I am closing this bug, and filing a more specific bug for the canonical labeler