Menu

#1089 Problems with SMILES parsing

cdk-1.2.x
closed
nobody
5
2012-11-03
2010-08-19
No

I am trying to convert a listing of SMILES codes generated by OSCAR 3 into Molecules and than into Fingerprints.
Surprisingly the CDK seems to have problems parsing atom symbols with two characters like Se, Ce, Eu etc. This happens both for organic and inorganic structures. You can find several examples (amongst other molecules with parsing errors) in the attached file.

I am using the SMILES parser like this:

    final SmilesParser sparser = new SmilesParser(DefaultChemObjectBuilder.getInstance());
    final Fingerprinter fingerprinter = new Fingerprinter();
    final IMolecule mol = sparser.parseSmiles(smiles);
    final BitSet fingerprint = fingerprinter.getFingerprint(mol);

As I assume that I am using the API in the right way, my question is, is it a bug or is it stupidity of me and if so where is the usage error?

Discussion

  • Sascha Toennies

    Sascha Toennies - 2010-08-19
     
  • Egon Willighagen

    That code looks OK. Can you give example SMILES, please?

     
  • Sascha Toennies

    Sascha Toennies - 2010-08-19

    There are several SMILES with the image and the corresponding Execption in the attached log.html file.
    One example would be: c1cc2cccnc2[se]1

     
  • Egon Willighagen

    OK, thanx. That's the kind of information we need! Several of the core developers will be at the ACS meeting next week, but I will fix at least some of these SMILES soon after.

     
  • Egon Willighagen

    Unit tests are available as patch:

    3054441 Unit tests for SMILES parsing bugs reported in #3048501

    Bug fixes are available as patches:

    3054454 Adds two char elem symbols missing in the SMILESParser
    3055418 Fix for SMILES parsing of 'aromatic' two-char elements

    Sascha, please feel free to try them out, and/or review the patch, and let me know if it solves your problems.

     
  • Egon Willighagen

    OK, I think all the bug fixes are now in the source code repository. Mind you, these only fix the SMILES parsing.

    What you will likely see next for several of these elements is that the atom typer knows nothing about them. Those are separate problems: the CDK knows nothing about the chemistry of those elements. So, if you see an error message like:

    Cannot percieve atom type for the 7th atom: Se

    or any of those other elements, you will need to provide the following information about those elements:

    1. element symbol (easy)
    2. formal charge
    3. number of bonded atoms
    4. number of lone pairs
    5. number of electrons available for pi-bonding
    6. hybridization type (e.g. sp3 or sp3d2)

    (and preferably a public database identifier (such as PubChem, ChemSpider, ChEBI, etc) for an example compound, so that I can write a unit test)

    These bits of details are not easy to recover, but needed nevertheless. Many cheminformatics algorithms need two or more of these field values.

    I'll close this bug. Please file a new bug report if you like one of the above atom types defined and include the values for those six fields.

     
MongoDB Logo MongoDB