Open Babel / Bugs / #758 SMILES parser should fail with bad characters

#758 SMILES parser should fail with bad characters

Status: open

Owner: Noel O'Boyle

Labels: None

Priority: 5

Updated: 2012-10-23

Created: 2011-10-19

Creator: Andrew Dalke

Private: No

formats/smilesformat.cpp contains the two lines (starting at line 407)

if (*_ptr<0 || isspace(*_ptr))
          continue;

This tells it to ignore any whitespace in the SMILES string (so the non-SMILES "C P" gets interpreted as "CP").

It also tells the parser to ignore any bytes with the high bit set.

The latter is a problem if the input comes from a UTF-8 encoded source, where the non-ASCII characters are encoded with bytes which set the high bit. It means that an input like "CöP" will be interpreted as "CP".

The fix should be:

if (*_ptr<0 || isspace(*_ptr))
          return false;  // unsupported character

BTW, ideally "_ptr" should be a "signed char *". SGIs, for example, used an unsigned char as the default "char" type. I don't know if this is important these days.

Discussion

Noel O'Boyle - 2012-09-04

Sorry - could you clarify exactly what whitespace is currently ignored?

pybel.readstring("smi", "C P").molwt
16.042459999999998
pybel.readstring("smi", "CP").molwt
48.024160999999999

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Dalke - 2012-09-05

Ahh, I see. Somewhere before then the string was truncated at the ' ' and the '\t'.. and the '\n' so those never get to ParseSmiles. However, "\v" passes through to that point and triggers the problem that my code review spotted. Here's a reproducible.

>>> pybel.readstring("smi", "C\vC").molwt 30.06904
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Dalke - 2012-09-05

I think my "BTW" is wrong. "isprint" is a more portable way to handle what *_ptr<0 is I think trying to do.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Noel O'Boyle - 2012-09-05

Applied in r5005.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Noel O'Boyle - 2012-09-06

Reverted r5005 (in r5006). Causes inchiSteffen_PubChem.smi_Test to fail on Linux platforms (but not under MSVC). Don't have time now to sort it out, so just reverting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SMILES parser should fail with bad characters

Group

Searches

Help

#758 SMILES parser should fail with bad characters

Discussion