Thread: Re: [Rdkit-devel] [ rdkit-Bugs-1932365 ]

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Adrian,

Sorry for a CC to a list you're subscribed to, but I'm not sure if you
read the mailing list and this is a message/discussion that needs to
be on the mailing list.

Here's the text from your bug report:
------------------------
RDkit does neither return a molecule nor raise an error if an element in a
SMILES string is lowercase:

In [15]: Chem.MolFromSmiles('c1c[se]c2=NCC(=c21)CC(C(=O)O)N')

In [16]: Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')
Out[16]: <Chem.rdchem.Mol object at 0x8bc93e4>
As far as I know lowercase elements are not really SMILES standard
therefore I would propose to throw some kind of error which can be handled
nicely in Python.
------------------------

There are two things going on here:

1) Se is not recognized as an element that can be aromatic. The
consequence of this is that [se] is not recognized. In your line [16]
above, the molecule is processed correctly, but if you look at the
SMILES:
[2]>>> m = Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')

[3]>>> Chem.MolToSmiles(m)
Out[3] 'NC(CC1=C2C=C[Se]C2=NC1)C(=O)O'

you see that the ring is not recognized as aromatic. This is a very
straightforward fix. Within the proposed OpenSmiles spec
(http://opensmiles.org/spec/open-smiles-2-grammar.html), both Se and
As can be marked aromatic. Unless I hear otherwise from the list, I
will support this and modify the smiles parser and aromaticity
recognition code to accept Se and As as aromatic atoms.

2) When the smiles parser (or the other molecule parsers) fails, the
result is a "None" as a return value. The theory behind this was that
the code to catch bad molecules is then much simpler:
 for s in smiles:
   m = Chem.MolFromSmiles(s)
   if not m: continue
instead of:
 for s in smiles:
    try:
       m = Chem.MolFromSmiles(s)
    except:
       continue
That's the theory. The practice is that the try/except is still needed
because the sanitization code can throw exceptions:
[4]>>> Chem.MolFromSmiles('c1cccc1')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/glandrum/RDKit/Code/GraphMol/SmilesParse/<ipython console> in <module>()

ValueError: Sanitization error: Can't kekulize mol

This is inconsistent, confusing, and wrong. So I'll fix it. The
question is how. I see two options:
   A) Continue returning None for failed parsing, but catch the
kekulization exceptions on the C++ side so that they don't make it
through to Python (i.e. return None from line [4] above)
   B) Have all parse failures raise exceptions.

I'm leaning towards A) because I think the resulting client code is
much more readable. This really becomes important with suppliers,
which I think should also be consistent. Solution A) allows code like
this:

suppl = Chem.SDMolSupplier('file.sdf')
for mol in suppl:
  if not mol: continue
  <do something>

instead of this, which solution B would require (not tested):

suppl = Chem.SDMolSupplier('file.sdf')
while 1:
  try:
    m = suppl.next()
  except StopIteration:
    break
  else:
   continue
  < do something >

Any arguments against me implementing solution A?

Thanks again for the detailed and careful bug report,
-greg

Thread: Re: [Rdkit-devel] [ rdkit-Bugs-1932365 ]

Open-Source Cheminformatics and Machine Learning

rdkit-devel