Thread: Re: [Rdkit-devel] [ rdkit-Bugs-1932365 ]
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2008-04-02 16:40:55
|
Hi Adrian,
Sorry for a CC to a list you're subscribed to, but I'm not sure if you
read the mailing list and this is a message/discussion that needs to
be on the mailing list.
Here's the text from your bug report:
------------------------
RDkit does neither return a molecule nor raise an error if an element in a
SMILES string is lowercase:
In [15]: Chem.MolFromSmiles('c1c[se]c2=NCC(=c21)CC(C(=O)O)N')
In [16]: Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')
Out[16]: <Chem.rdchem.Mol object at 0x8bc93e4>
As far as I know lowercase elements are not really SMILES standard
therefore I would propose to throw some kind of error which can be handled
nicely in Python.
------------------------
There are two things going on here:
1) Se is not recognized as an element that can be aromatic. The
consequence of this is that [se] is not recognized. In your line [16]
above, the molecule is processed correctly, but if you look at the
SMILES:
[2]>>> m = Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')
[3]>>> Chem.MolToSmiles(m)
Out[3] 'NC(CC1=C2C=C[Se]C2=NC1)C(=O)O'
you see that the ring is not recognized as aromatic. This is a very
straightforward fix. Within the proposed OpenSmiles spec
(http://opensmiles.org/spec/open-smiles-2-grammar.html), both Se and
As can be marked aromatic. Unless I hear otherwise from the list, I
will support this and modify the smiles parser and aromaticity
recognition code to accept Se and As as aromatic atoms.
2) When the smiles parser (or the other molecule parsers) fails, the
result is a "None" as a return value. The theory behind this was that
the code to catch bad molecules is then much simpler:
for s in smiles:
m = Chem.MolFromSmiles(s)
if not m: continue
instead of:
for s in smiles:
try:
m = Chem.MolFromSmiles(s)
except:
continue
That's the theory. The practice is that the try/except is still needed
because the sanitization code can throw exceptions:
[4]>>> Chem.MolFromSmiles('c1cccc1')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/glandrum/RDKit/Code/GraphMol/SmilesParse/<ipython console> in <module>()
ValueError: Sanitization error: Can't kekulize mol
This is inconsistent, confusing, and wrong. So I'll fix it. The
question is how. I see two options:
A) Continue returning None for failed parsing, but catch the
kekulization exceptions on the C++ side so that they don't make it
through to Python (i.e. return None from line [4] above)
B) Have all parse failures raise exceptions.
I'm leaning towards A) because I think the resulting client code is
much more readable. This really becomes important with suppliers,
which I think should also be consistent. Solution A) allows code like
this:
suppl = Chem.SDMolSupplier('file.sdf')
for mol in suppl:
if not mol: continue
<do something>
instead of this, which solution B would require (not tested):
suppl = Chem.SDMolSupplier('file.sdf')
while 1:
try:
m = suppl.next()
except StopIteration:
break
else:
continue
< do something >
Any arguments against me implementing solution A?
Thanks again for the detailed and careful bug report,
-greg
|
|
From: Adrian S. <adr...@gm...> - 2008-04-02 17:29:07
|
Personally, I would appreciate a solution which indicates why
structure parsing has failed (or at least allows some kind of debug).
I do not know if this is feasible in wrapped C++ code but it is really
useful to know where the problem arises from, particularly in the
context of bug reports.
I have to agree solution B is neither very elegant nor consistent but
maybe there is a way to implement some kind of debug mode?
On Wed, Apr 2, 2008 at 5:40 PM, Greg Landrum <gre...@gm...> wrote:
> Hi Adrian,
>
> Sorry for a CC to a list you're subscribed to, but I'm not sure if you
> read the mailing list and this is a message/discussion that needs to
> be on the mailing list.
>
> Here's the text from your bug report:
> ------------------------
> RDkit does neither return a molecule nor raise an error if an element in a
> SMILES string is lowercase:
>
>
> In [15]: Chem.MolFromSmiles('c1c[se]c2=NCC(=c21)CC(C(=O)O)N')
>
> In [16]: Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')
> Out[16]: <Chem.rdchem.Mol object at 0x8bc93e4>
> As far as I know lowercase elements are not really SMILES standard
> therefore I would propose to throw some kind of error which can be handled
> nicely in Python.
> ------------------------
>
> There are two things going on here:
>
> 1) Se is not recognized as an element that can be aromatic. The
> consequence of this is that [se] is not recognized. In your line [16]
> above, the molecule is processed correctly, but if you look at the
> SMILES:
> [2]>>> m = Chem.MolFromSmiles('c1c[Se]c2=NCC(=c21)CC(C(=O)O)N')
>
> [3]>>> Chem.MolToSmiles(m)
> Out[3] 'NC(CC1=C2C=C[Se]C2=NC1)C(=O)O'
>
> you see that the ring is not recognized as aromatic. This is a very
> straightforward fix. Within the proposed OpenSmiles spec
> (http://opensmiles.org/spec/open-smiles-2-grammar.html), both Se and
> As can be marked aromatic. Unless I hear otherwise from the list, I
> will support this and modify the smiles parser and aromaticity
> recognition code to accept Se and As as aromatic atoms.
>
> 2) When the smiles parser (or the other molecule parsers) fails, the
> result is a "None" as a return value. The theory behind this was that
> the code to catch bad molecules is then much simpler:
> for s in smiles:
> m = Chem.MolFromSmiles(s)
> if not m: continue
> instead of:
> for s in smiles:
> try:
> m = Chem.MolFromSmiles(s)
> except:
> continue
> That's the theory. The practice is that the try/except is still needed
> because the sanitization code can throw exceptions:
> [4]>>> Chem.MolFromSmiles('c1cccc1')
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
>
> /home/glandrum/RDKit/Code/GraphMol/SmilesParse/<ipython console> in <module>()
>
> ValueError: Sanitization error: Can't kekulize mol
>
> This is inconsistent, confusing, and wrong. So I'll fix it. The
> question is how. I see two options:
> A) Continue returning None for failed parsing, but catch the
> kekulization exceptions on the C++ side so that they don't make it
> through to Python (i.e. return None from line [4] above)
> B) Have all parse failures raise exceptions.
>
> I'm leaning towards A) because I think the resulting client code is
> much more readable. This really becomes important with suppliers,
> which I think should also be consistent. Solution A) allows code like
> this:
>
> suppl = Chem.SDMolSupplier('file.sdf')
> for mol in suppl:
> if not mol: continue
> <do something>
>
> instead of this, which solution B would require (not tested):
>
> suppl = Chem.SDMolSupplier('file.sdf')
> while 1:
> try:
> m = suppl.next()
> except StopIteration:
> break
> else:
> continue
> < do something >
>
> Any arguments against me implementing solution A?
>
> Thanks again for the detailed and careful bug report,
> -greg
>
|
|
From: Greg L. <gre...@gm...> - 2008-04-02 18:15:14
|
On Wed, Apr 2, 2008 at 7:29 PM, Adrian Schreyer
<adr...@gm...> wrote:
> Personally, I would appreciate a solution which indicates why
> structure parsing has failed (or at least allows some kind of debug).
> I do not know if this is feasible in wrapped C++ code but it is really
> useful to know where the problem arises from, particularly in the
> context of bug reports.
>
> I have to agree solution B is neither very elegant nor consistent but
> maybe there is a way to implement some kind of debug mode?
What about solution A (returning None on parse failure), with messages
displayed to stderr using the logging mechanism? I.e. when you'd see
something like this:
>>> m = Chem.MolFromSmiles('c1cccc1')
[TIMESTAMP] Sanitization error: Can't kekulize mol
>>> m is None
True
-greg
|
|
From: Adrian S. <adr...@gm...> - 2008-04-02 18:19:54
|
Would be fine with me!
Adrian
On Wed, Apr 2, 2008 at 7:14 PM, Greg Landrum <gre...@gm...> wrote:
> On Wed, Apr 2, 2008 at 7:29 PM, Adrian Schreyer
> <adr...@gm...> wrote:
> > Personally, I would appreciate a solution which indicates why
> > structure parsing has failed (or at least allows some kind of debug).
> > I do not know if this is feasible in wrapped C++ code but it is really
> > useful to know where the problem arises from, particularly in the
> > context of bug reports.
> >
> > I have to agree solution B is neither very elegant nor consistent but
> > maybe there is a way to implement some kind of debug mode?
>
> What about solution A (returning None on parse failure), with messages
> displayed to stderr using the logging mechanism? I.e. when you'd see
> something like this:
> >>> m = Chem.MolFromSmiles('c1cccc1')
> [TIMESTAMP] Sanitization error: Can't kekulize mol
> >>> m is None
> True
>
>
> -greg
>
|
|
From: Greg L. <gre...@gm...> - 2008-04-09 19:24:51
|
Adrian,
I'm not sure if you get email when the bugs are updated or not, so
I'll go ahead and post. The proposed solution here has been
implemented and checked in:
> On Wed, Apr 2, 2008 at 7:14 PM, Greg Landrum <gre...@gm...> wrote:
> >
> > What about solution A (returning None on parse failure), with messages
> > displayed to stderr using the logging mechanism? I.e. when you'd see
> > something like this:
> > >>> m = Chem.MolFromSmiles('c1cccc1')
> > [TIMESTAMP] Sanitization error: Can't kekulize mol
> > >>> m is None
> > True
The various file parsers (including the suppliers) should no longer
throw exceptions when they fail. They should display error messages
and return None. The exception to this is when the file parsers (or
suppliers) fail to open the input file; in this case you'll get an
IOError, which is the Pythonic (IMO) way of doing things.
Let me know if you have a chance to try it and find any problems,
-greg
|
|
From: Adrian S. <adr...@gm...> - 2008-04-09 21:09:05
|
Hi Greg,
I get the emails - I will test it tomorrow with the MSDchem set and
send you the results.
Adrian
On Wed, Apr 9, 2008 at 8:24 PM, Greg Landrum <gre...@gm...> wrote:
> Adrian,
>
> I'm not sure if you get email when the bugs are updated or not, so
> I'll go ahead and post. The proposed solution here has been
> implemented and checked in:
>
>
> > On Wed, Apr 2, 2008 at 7:14 PM, Greg Landrum <gre...@gm...> wrote:
> > >
>
> > > What about solution A (returning None on parse failure), with messages
> > > displayed to stderr using the logging mechanism? I.e. when you'd see
> > > something like this:
> > > >>> m = Chem.MolFromSmiles('c1cccc1')
> > > [TIMESTAMP] Sanitization error: Can't kekulize mol
> > > >>> m is None
> > > True
>
> The various file parsers (including the suppliers) should no longer
> throw exceptions when they fail. They should display error messages
> and return None. The exception to this is when the file parsers (or
> suppliers) fail to open the input file; in this case you'll get an
> IOError, which is the Pythonic (IMO) way of doing things.
>
> Let me know if you have a chance to try it and find any problems,
> -greg
>
|