formats/smilesformat.cpp contains the two lines (starting at line 407)
if (*_ptr<0 || isspace(*_ptr)) continue;
This tells it to ignore any whitespace in the SMILES string (so the non-SMILES "C P" gets interpreted as "CP").
It also tells the parser to ignore any bytes with the high bit set.
The latter is a problem if the input comes from a UTF-8 encoded source, where the non-ASCII characters are encoded with bytes which set the high bit. It means that an input like "CöP" will be interpreted as "CP".
The fix should be:
if (*_ptr<0 || isspace(*_ptr)) return false; // unsupported character
BTW, ideally "_ptr" should be a "signed char *". SGIs, for example, used an unsigned char as the default "char" type. I don't know if this is important these days.
Sorry - could you clarify exactly what whitespace is currently ignored?
Ahh, I see. Somewhere before then the string was truncated at the ' ' and the '\t'.. and the '\n' so those never get to ParseSmiles. However, "\v" passes through to that point and triggers the problem that my code review spotted. Here's a reproducible.
I think my "BTW" is wrong. "isprint" is a more portable way to handle what *_ptr<0 is I think trying to do.
Applied in r5005.
Reverted r5005 (in r5006). Causes inchiSteffen_PubChem.smi_Test to fail on Linux platforms (but not under MSVC). Don't have time now to sort it out, so just reverting.