From: John M. <joh...@gm...> - 2013-02-15 10:12:41
|
Hi Michael & Christoph, Thanks for the info. I'll need some time look into it but it doesn't sound right :-). As far as I know, the SMILES parser/generator was written by different people and as you say it isn't really loyal to one particular representation. I need more time to look at the other issues but I can try and answer some of the specific questions. > 1) Why is the use of aromaticity not always enabled by default as it plays a key role for proper structure representation, mainly in case of SmilesGeneration and MOL file creation? Bond Type 4 in MDL is a query bond order, that is, to be used to query molecules. However, In reality it is used for general representation but this is not correct. The SMILES generator should have aromatic be on by default but there is definitely a bug with those nitrogens. > 2) The SMILES c1cccc2nnnc12 appears to be valid when parsed with the SmilesParser. What happens so that the nitrogens are suddenly treated as aliphatic although being aromatic? This might be because there is a aromatic flag on both atoms and bonds. Unfortunately the API isn't design in a way to force you to check both of these and it's likely the SMILES generator relies on aromatic flags on atoms whilst the MDL only loads the flags on bonds. Haven't looked at the code yet but it should be fixable by changing the MDL reading to set aromatic flags on the atoms as well as bonds. > 3) How different are the SMILES implentations from CDK (which appears to rely on Daylight SMILES implementation) and ACD or ChemSpider? If you have any knowledge about that? Not sure this answerable. We really want to have a proper and complete OpenSMILES implementation. It's on my radar but unfortunately so are many other things, any contribution are much welcome for this. There was an OpenSMILES parser written in Scala a couple on months ago, https://github.com/stefan-hoeck/chemf (GPL). I'm not sure if it does any interruption but it could be a good starting point. > 4) Do you know a SMARTS pattern that might fix the problem with the explicit hydrogen position for the three nitrogens so the "invalid" SMILES could be replaced by a proper SMARTS? Bit confused a bit on this, whats the query you are trying to do and how are you doing the SMARTS matching? SMARTSQueryTool -> SMARTS searching FixBondOrdersTool -> aromatic to kekule representation As I said I can hopefully get back to you with more details soon. Many thanks, J On 15 Feb 2013, at 09:11, cruttkie <cru...@ip...> wrote: > Dear CDK developers, > > we're currently using the CDK a lot for SMILES generation, SMARTS matching and MOL/SDF import/export functionality. > With the latest CDK version 1.4.15, and also prior ones, we did encounter some strange behavior related to the aromaticity handling especially when using SMILES. > The following SMILES was generated from ACD ChemSketch: c1cccc2nnnc12 > Clearly the unspecified position of the hydrogen for one of the nitrogens is not proper, but that's the output from ACD. > Iterating the position of the hydrogen would require three separate SMILES like this c1cccc2nn[nH]c12, which makes it a little bit unfeasible for general substructure matching as we currently want to do with the help of CDK and ChemSpider. > > The following code should show you what we have done in order to generate a SMILES and/or IAtomContainer from the above string representation. > > String substrucPresent = "c1cccc2nnnc12"; > // convert input SMILES to MOL format for ChemSpider service > SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance()); > sp.setPreservingAromaticity(false); > String mol = ""; > String s = ""; > try { > IMolecule temp = sp.parseSmiles(substrucPresent); > System.out.println("aromatic Hueckel? -> " + CDKHueckelAromaticityDetector.detectAromaticity(temp)); > System.out.println("aromatic double bond? -> " + DoubleBondAcceptingAromaticityDetector.detectAromaticity(temp)); > // create coordinates > StructureDiagramGenerator sdg = new StructureDiagramGenerator(); > sdg.setMolecule(temp); > sdg.generateCoordinates(); > IMolecule layedOutMol = sdg.getMolecule(); > > byte[] b = null; > ByteArrayOutputStream bos = new ByteArrayOutputStream(); > MDLV2000Writer writer = new MDLV2000Writer(bos); > IOSetting[] ios = writer.getIOSettings(); > for (int i = 0; i < ios.length; i++) { > System.out.println(ios[i].getName() + "\t" + ios[i].getSetting()); > } > Properties customSettings = new Properties(); > customSettings.setProperty("ForceWriteAs2DCoordinates", "true"); > customSettings.setProperty("WriteAromaticBondTypes", "true"); > PropertiesListener listener = new PropertiesListener(customSettings); > writer.addChemObjectIOListener(listener); > > writer.write(layedOutMol); > writer.close(); > b = bos.toByteArray(); > mol = new String(b, "UTF-8"); > System.out.println(mol); > IAtomContainer test2 = null; > InputStream is = new ByteArrayInputStream(mol.getBytes()); > MDLV2000Reader reader = new MDLV2000Reader(is); > IChemFile chemFile = new ChemFile(); > try { > chemFile = (IChemFile) reader.read(chemFile); > test2 = ChemFileManipulator.getAllAtomContainers(chemFile).get(0); > } catch (CDKException e) { > System.err.println("CDKException occured!"); > } > System.out.println("aromatic? -> " + DoubleBondAcceptingAromaticityDetector.detectAromaticity(test2)); > > SmilesGenerator sg = new SmilesGenerator(true); > s = sg.createSMILES(layedOutMol); > System.out.println("old smiles -> " + substrucPresent); > System.out.println("smiles -> " + s); > } catch (InvalidSmilesException e2) { > // TODO Auto-generated catch block > e2.printStackTrace(); > } catch (CDKException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > > Depending on how both the SmilesParser and SmilesGenerator are set up regarding aromaticity, we get different results. > As the original SMILES c1cccc2nnnc12 is parsed properly and can be converted to an IAtomContainer, the detection of its aromaticity works in both aromaticity detectors. > When converting the IAtomContainer to mol format (and explicitly stating to use aromatic bond types 4, the newly loaded IAtomContainer test2 still shows aromaticity, but the SMILES is either N1NC2CCCCC2(N1) without using the aromaticityFlag or N1Nc2ccccc2(N1) with the flag. So only the carbons are treated as aromatic, but not the nitrogens. > > There are several questions on my mind: > 1) Why is the use of aromaticity not always enabled by default as it plays a key role for proper structure representation, mainly in case of SmilesGeneration and MOL file creation? > > 2) The SMILES c1cccc2nnnc12 appears to be valid when parsed with the SmilesParser. What happens so that the nitrogens are suddenly treated as aliphatic although being aromatic? > > 3) How different are the SMILES implentations from CDK (which appears to rely on Daylight SMILES implementation) and ACD or ChemSpider? If you have any knowledge about that? > > 4) Do you know a SMARTS pattern that might fix the problem with the explicit hydrogen position for the three nitrogens so the "invalid" SMILES could be replaced by a proper SMARTS? > > > Second, we worked with different bond annotations for representing aromaticity. Two sdfs including the same structures, the first with the bond aromaticity marked explicitly with 4 (examples_4.sdf), and the second file with bonds marked with the double and single bond flag 1 and 2 (examples_2.sdf). The second one was created with OpenBabel out of the first one. Reading the files with CDK (1.4.2) and converting them to SMILES format yields different strings, and even the molecular formulas are different concerning to the number of hydrogens related to the wrong aromaticity detection. The result file is attached (smiles_examples_2_4.txt). We can send the code if necessary. While the converting with CDK yields different SMILES, OpenBabel can handle it properly. > > What is the reason CDK has problems with the aromaticity flag in this case, and is their a workaround which we didn't contemplate, yet? > > > Kind regards, > Michael & Christoph > <c1cccc2nnnc12_WriteAromaticBondTypes_false.mol><c1cccc2nnnc12_WriteAromaticBondTypes_true.mol><c1cccc2nn[nH]c12_WriteAromaticBondTypes_true.mol><openbabel_c1cccc2nnnc12.mol><examples_2_4.tgz>------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb_______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel |