Menu

#758 SMILES parser should fail with bad characters

open
None
5
2012-10-23
2011-10-19
No

formats/smilesformat.cpp contains the two lines (starting at line 407)

if (*_ptr<0 || isspace(*_ptr))
          continue;

This tells it to ignore any whitespace in the SMILES string (so the non-SMILES "C P" gets interpreted as "CP").

It also tells the parser to ignore any bytes with the high bit set.

The latter is a problem if the input comes from a UTF-8 encoded source, where the non-ASCII characters are encoded with bytes which set the high bit. It means that an input like "CöP" will be interpreted as "CP".

The fix should be:

if (*_ptr<0 || isspace(*_ptr))
          return false;  // unsupported character

BTW, ideally "_ptr" should be a "signed char *". SGIs, for example, used an unsigned char as the default "char" type. I don't know if this is important these days.

Discussion

  • Noel O'Boyle

    Noel O'Boyle - 2012-09-04

    Sorry - could you clarify exactly what whitespace is currently ignored?

    pybel.readstring("smi", "C P").molwt
    16.042459999999998
    pybel.readstring("smi", "CP").molwt
    48.024160999999999

     
  • Andrew Dalke

    Andrew Dalke - 2012-09-05

    Ahh, I see. Somewhere before then the string was truncated at the ' ' and the '\t'.. and the '\n' so those never get to ParseSmiles. However, "\v" passes through to that point and triggers the problem that my code review spotted. Here's a reproducible.

    >>> pybel.readstring("smi", "C\vC").molwt
    30.06904
    
     
  • Andrew Dalke

    Andrew Dalke - 2012-09-05

    I think my "BTW" is wrong. "isprint" is a more portable way to handle what *_ptr<0 is I think trying to do.

     
  • Noel O'Boyle

    Noel O'Boyle - 2012-09-05

    Applied in r5005.

     
  • Noel O'Boyle

    Noel O'Boyle - 2012-09-06

    Reverted r5005 (in r5006). Causes inchiSteffen_PubChem.smi_Test to fail on Linux platforms (but not under MSVC). Don't have time now to sort it out, so just reverting.