|
From: Egon W. <ego...@gm...> - 2013-02-15 12:47:03
|
Hi all, Just as a quick note... On Fri, Feb 15, 2013 at 10:11 AM, cruttkie <cru...@ip...> wrote: > The following SMILES was generated from ACD ChemSketch: c1cccc2nnnc12 > Clearly the unspecified position of the hydrogen for one of the nitrogens is > not proper, but that's the output from ACD. I do not think the position of that hydrogen is a problem, but more that one does not have the explicit hydrogen defined... It seems that most toolkits are fairly OK with parsing it: http://apps.ideaconsult.net:8080/ambit2/depict?search=c1cccc2nnnc12&smarts= But note that Daylight's Depict marks the string as an invalid SMILES, but a query instead. Now the problem is that the input has two bits of information for all three nitrogens: - it is trivalent (lowest valency is the default for the organic subset, with 3,5 as possible N valencies) - it is aromatic/sp2 (lower case organic subset) The latter is very unclearly defined, and some cleanup has been attempted in the OpenSMILES specification. Now, the original specification talks about sp2, and the only way to have a sp2, trivalent nitrogen is with a double bond, but that is not possible for all three nitrogens in the above structure, and hence the SMILES may come from ACD ChemSketch, but is faulty. Then, in practice many tools knows about this common mistake in SMILES strings, and deal with it nevertheless, and make a good guess at it, as visible from the AMBIT service. Now, the CDK does not do gambling on what was meant, and ends up with an internal data structure. In CDK 1.4 there is no handling of unknown bond order, but that has been dealt with in 1.5/1.6. Regarding the aromaticity, SMILES becomes even trickier, and the original SMILES requires the toolkit to perceive the aromaticity itself. And that is hard with ambigue input... > 1) Why is the use of aromaticity not always enabled by default as it plays a key role for proper structure representation, mainly in case of SmilesGeneration and MOL file creation? Because aromaticity is a really difficult concept, there is no definition of it, and everyone disagrees on that definition. > 2) The SMILES c1cccc2nnnc12 appears to be valid when parsed with the SmilesParser. What happens so that the nitrogens are suddenly treated as aliphatic although being aromatic? Yes, there is a known unit test fail about that: http://pele.farmbio.uu.se/nightly-1.4.x/test/result-smiles.html See testPyrrole3() > 3) How different are the SMILES implentations from CDK (which appears to rely on Daylight SMILES implementation) and ACD or ChemSpider? If you have any knowledge about that? No clue. They do not use Open Source. > 4) Do you know a SMARTS pattern that might fix the problem with the explicit hydrogen position for the three nitrogens so the "invalid" SMILES could be replaced by a proper SMARTS? That would be an interesting question for the Blue Obelisk eXchange :) Grtz, Egon -- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |