Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo


#1089 Problems with SMILES parsing

Sascha Toennies

I am trying to convert a listing of SMILES codes generated by OSCAR 3 into Molecules and than into Fingerprints.
Surprisingly the CDK seems to have problems parsing atom symbols with two characters like Se, Ce, Eu etc. This happens both for organic and inorganic structures. You can find several examples (amongst other molecules with parsing errors) in the attached file.

I am using the SMILES parser like this:

    final SmilesParser sparser = new SmilesParser(DefaultChemObjectBuilder.getInstance());
    final Fingerprinter fingerprinter = new Fingerprinter();
    final IMolecule mol = sparser.parseSmiles(smiles);
    final BitSet fingerprint = fingerprinter.getFingerprint(mol);

As I assume that I am using the API in the right way, my question is, is it a bug or is it stupidity of me and if so where is the usage error?


  • That code looks OK. Can you give example SMILES, please?

  • There are several SMILES with the image and the corresponding Execption in the attached log.html file.
    One example would be: c1cc2cccnc2[se]1

  • OK, thanx. That's the kind of information we need! Several of the core developers will be at the ACS meeting next week, but I will fix at least some of these SMILES soon after.

  • Unit tests are available as patch:

    3054441 Unit tests for SMILES parsing bugs reported in #3048501

    Bug fixes are available as patches:

    3054454 Adds two char elem symbols missing in the SMILESParser
    3055418 Fix for SMILES parsing of 'aromatic' two-char elements

    Sascha, please feel free to try them out, and/or review the patch, and let me know if it solves your problems.

  • OK, I think all the bug fixes are now in the source code repository. Mind you, these only fix the SMILES parsing.

    What you will likely see next for several of these elements is that the atom typer knows nothing about them. Those are separate problems: the CDK knows nothing about the chemistry of those elements. So, if you see an error message like:

    Cannot percieve atom type for the 7th atom: Se

    or any of those other elements, you will need to provide the following information about those elements:

    1. element symbol (easy)
    2. formal charge
    3. number of bonded atoms
    4. number of lone pairs
    5. number of electrons available for pi-bonding
    6. hybridization type (e.g. sp3 or sp3d2)

    (and preferably a public database identifier (such as PubChem, ChemSpider, ChEBI, etc) for an example compound, so that I can write a unit test)

    These bits of details are not easy to recover, but needed nevertheless. Many cheminformatics algorithms need two or more of these field values.

    I'll close this bug. Please file a new bug report if you like one of the above atom types defined and include the values for those six fields.