CDK 1.5.12 give errors when parsing these three SMILES (all are ChEMBL compounds).
O(CC)C(\C(\C(C(=O)OCC)=O)=N(/=C(C)N1C)\c2c1cccc2)=O
, Multiple bonds specified: The reason seems to be the /=
bond specification
org.openscience.cdk.exception.InvalidSmilesException: could not parse OC(=C(\C\1=N\CCCN(CC)c2cccc(C)c2)O)\C\1=N/C(CCN3C(=O)OCC)CC3
, Ring closure bonds did not match. Ring was opened with '\' and closed with '/'. Note - directional bonds ('/','\') are relative.
org.openscience.cdk.exception.InvalidSmilesException: could not parse 'Fc1ccccc1N2CCN(CC2)CC\N=C/3\C(=C(/C/3=N\C(CCN4C(=O)OCC)CC4)O)O', Ring closure bonds did not match. Ring was opened with '/' and closed with '\'. Note - directional bonds ('/','\') are relative.
OpenBabel and Chemical Identifier Resolver successfully parse all three.
Any ideas?
Grrr ChEMBL and their crappy SMILES. These are bugs in which ever program they use to write the SMILES (Pipeline Pilot I think). I've told them about in numberous times but doubt they'll action it.
I'll address the second one first, for directional bonds (slashes) the meaning is relative. These are all the same structure:
C/1CCCCCC\C=C\1
C/1CCCCCC\C=C1
C1CCCCCC\C=C\1
C\1CCCCCC/C=C/1
C\1CCCCCC/C=C1
C1CCCCCC/C=C/1
Notice the slashes on ring closures reverse. So when they're the same it doesn't make sense:
C/1.C/C=C/1 (RDKit=trans, OE/CA=cis, OB=none)
C\1.C\C=C\1 (RDKit=trans, OE/CA=cis, OB=none)
C/1.C\C=C/1 (RDKit=cis, OE/Ca=trans, OB=none)
C\1.C/C=C\1 (RDKit=cis, OE/CA=trans, OB=none)
OpenBabel behavious is probably acceptble but it does lose information on read and normally if there's something wrong with the syntax it's likely to be duff input.
In regards to the first point, \= is not valid in SMILES. Roger's pointed out to me it's a possible extension but we then found /# which completely undermines any argument it's use was intented:
Curiously these structures apear to be the result of InChI mangaling (getting a connection table from the InChI is nearlly always wrong!).
We actually used to accept the '\=' '/#' because the parser just use to ignore muliple bond specification but they are invalid and should not be accepted.
J
Last edit: John May 2016-02-19
John, is there any workaround you could suggest - or we should just forget about these 3 SMILES ?
24 in ChEMBL right?
Another case is a tetravalent aromatic carbon anion.
Again another bug where that carbon had bad valence and should not be aromatic.
You could pester ChEMBL to fix them. But I think in general these are bad molecules and would chuck them out.
John
What about writing a short note about this (e.g. for a preprint server)? Listing the key problems with the faulty SMILES? That way, people know about the issues. Do you see an automatic way of fixing these SMILES?
Hi John, you're right, we do use Pipeline PIlot for the canonical smiles. These compounds have been submitted to us via PubChem and they somehow passed through our normal curation check up even with the bad valence.
I will personally fix your example from - CHEMBL3210799 . We can't always control how the smiles are handled through PP but I can certainly fix the compounds which give bad smiles due to a bad valence. Please can someone send me the ChEMBL IDs for the 3 which caused issue, above? I tried to use the Smiles provided to find them in our DB, but can't find a match. You can send them to chembl-help@ebi.ac.uk
Just as an aside - I did fix the issue you sent to us last March (CC(=O)O[C@@H]1C@@HC@(O)C@HC2=C1C3=C(C(=O)c4c(O)cccc4C3=O)/C/2=N/#N
CHEMBL1982727 and the others that were similar) but due to the fact we haven't done a release since last January, you wouldn't have had access to the updated compounds. We are due to release ChEMBL_21 by the end of this month, so hopefully that's one issue fixed for you.
Please feel free to pass any more issues back to us at chembl-help@ebi.ac.uk
kind regards
Louisa (Chemical Curator for ChEMBL)
John, All,
Just to clarify. While the three I reported are ChEMBL compounds, the SMILES themselves do not come directly from ChEMBL. My report says these SMILES can't be parsed by CDK 1.5.12 , while they are parsed by e.g. OpenBabel. I see this as more generic compatibility problem, which is only partially resolved by fixing SMILES in ChEMBL. We can encounter such SMILES from other sources as well.
Thanks,
Nina
Last edit: Nina Jeliazkova 2016-02-19
Thanks Louisa. I couldn't remember when I sent the help message but makes sense if they're not released yet.
Nina, Garbage in, Garbage out. There is a scrict flag in the parser we could use but then how to we tell the user that there is something very wrong with this input? There are so many bad connection tables out there I've reached the opinion to not live and let live, just look at some PubChem entires! My latest favourite is CID 58150378 from attached image "DSOC". As for Open Babel accepting it, have you tried parsing eMolecules? - literally thousands of broken molecules compared to this handful in ChEMBL.
17282564
I will also add that we (NextMove) chuck away the ChEMBL generated SMILES and regenerate them from the SDfile. This also fixes problems like MDL's triangle rule (inverted stereo if the center is drawn a certain way). Of course the molfile has it's own problems.
John
John, personally I do prefer to work with SD files and regenerate SMILES from the connection table, but this is not everybody's preference :) Thanks for the hint about the the strict flag.
While on this, I would also mention most of the aromatic SMILES from the Open Melting Point dataset 10.6084/m9.figshare.1031637 are considered invalid by CDK 1.5 .
Last edit: Nina Jeliazkova 2016-02-19
I don't think it's expsoed so would need changes. By defauly it's relaxed, I considered these errors to be too bad to ignore.
If it's the aromatic nitrogen, they are definately invalid. Alas Daylight depict ceases to exist but poor implementations do not change the notion that a SMILES string has an exact formula.
You can fix these by turning off kekulize, making the change you see fit (add H, set charge), regenerating and then reading it in again. Daniel's almost talked me into special casing some pontentially unambgiuous ones (e.g. n1cccc1, n1ccc2c1cccc2) but I'm still not convineced.
I should add those are probably ChemSketch's fault - loads in Wikipedia as it's the recomended drawing tool to use sigh.
John
I've run a more exhaustive test on:
ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_20_chemreps.txt.gz
The results are as follows: 25 SMILES raise exceptions in CDK as previously indicated by John. All of them are also rejected by PubChem (tested through their REST API). Among those 25 SMILES there are 20 which OpenBabel parses. Among these 20 there are 5 which are also parsed by CIR (apparently running with different version/settings of Cactvs from those used by PubChem). There are 5 SMILES which are rejected by all (!) the tools we've tested (CDK, PubChem, OpenBabel and CIR).
I'm attaching a file with some more details on the 25 compounds (including ChEMBL IDs) and the testing outcomes (if some of the tools succeeded, this is explicitly noted).
Further report on ChEMBL SMILES parser failing , this time SMILES are generated with CDK 1.5.12,
SmilesGenerator.absolute()
The test starts at Line 271
https://sourceforge.net/p/ambit/code/HEAD/tree/trunk/ambit2-all/ambit2-core/src/test/java/ambit2/core/test/SmilesTest.java
The structures are here (with ChEMBL ids)
https://sourceforge.net/p/ambit/code/HEAD/tree/trunk/ambit2-all/ambit2-core/src/test/resources/ambit2/core/chembl/roundtrip7.sdf
They all fail with
Multiple directional bonds on atom X
Do I understand right there is no option currently to generate unique SMILES with stereo information besides
absolute()
?Last edit: Nina Jeliazkova 2016-02-27
Correct, absolute is the only option at the moment and is backed by the
InChI (Noel's Universal SMILES paper) so you sometimes see some odd
results. However this particular case is a limitation of SMILES. I'll
double check the structures but see the OpenSMILES spec:
Just a quick point there, there's a few things old with your tests:
going to deprecate the old one since I've seen this mistake so much, and
since it also occurs from a CDK expert (yourself) it's just stupid to have
the old bad one named better then the new one (Note I didn't add the new
one).
faster and also check stereo.
CDKHueckelAromaticityDetector is deprecated, use Aromaticity.
Regards,
John W May
john.wilkinsonmay@gmail.com
On 27 February 2016 at 07:57, Nina Jeliazkova vedina@users.sf.net wrote:
Related
Bugs: #1378
John,
Thanks. I understand the limitation of absolute(), just was not aware of its behaviour initially.
Re your points - this file was written years before CDK 1.5. The only new test is the roundtrip , which is the subject of this issue. The rest of the file is irrelevant fot this issue. My bad I did not separate old and new tests.
Yes, please do deprecate MDLReader . Or move it to a legacy package. There is lot of legacy code out there (also in Ambit) which will continue to use it otherwise.
we usually use AMBIT isomorphism tester , not the CDK one. I didn't know about 'Pattern.findIdentical' , which means we are not yet aware of everything new in CDK 1.5 .
I do need atom typing in order to be able reproduce my use case, which has several following steps (not in the test though)
You might notice the file uses ambit2.core.helper.CDKHueckelAromaticityDetector, which is a wrapper around Aromaticity.
Last edit: Nina Jeliazkova 2016-03-01
Much better in ChEMBL 21.
Directional Bonds:
Failed to kekulize:
CHEMBL3188982 is new and quite fun! It's a perfect example of why you shouldn't add hydrogens. Notice that ChemAxon and Open Babel will add a hydrogen to the tetrazole when infact it's a radical: https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL3188982
We've found only one ChEMBL21 compound, which does not survive the roundrtip test
read-from-sdf-->generate-smiles-->parse-smiles
, while using CDK 1.5.12new SmilesGenerator().isomeric()
https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL2369356
The error is
cannot assign double bond configuration to non-double bond
Test added to the same test file
https://sourceforge.net/p/ambit/code/HEAD/tree/trunk/ambit2-all/ambit2-core/src/test/java/ambit2/core/test/SmilesTest.java#l247
Last edit: Nina Jeliazkova 2016-03-11
On 11 March 2016 at 10:53, Nina Jeliazkova vedina@users.sf.net wrote:
Thanks, that should be fixable.
Regards,
John W May
john.wilkinsonmay@gmail.com
I'll close this bug. If new bad ChEMBL SMILES pop up, we can file them as new bugs.
Reopenning - was getting round to this,this week