From: Brian K. <bk...@wi...> - 2002-01-10 17:31:01
|
Christoph Steinbeck wrote: >Egon Willighagen wrote: > >>MDL Molformat is like CML (and like PDB, XYZ) a file format in which >>all the coordinates (2D or 3D of the atoms are given). And thus unlike >>SMILES where only the structural information is given. But since we >>have a SMILES reader and writer it as 4 hour job to have the database >>store SMILES instead of MDL or CML... >> > >I believe the SMILES reader has not been ported to the CDK yet. And we >never had a SMILES writer. Both should not be too much work, based on >the existing code, but there are some tricky issues: > >- You need to be able to find the Smallest Set of Smallest Rings (we >have a class for this, DONE). >- A good SMILES writer should be able to detect aromaticity (There is >some basic code for this, PARTLY DONE). > I have written a python version of a smiles system with the help of Andrew Dalke. He has an excellent, superb regular expression system that he donated to me for reading Smiles and Smarts. I am in a bit of awe of this by the way. As far as I can tell the tokenizers support all of the Smiles and Smarts languages. It uses some functionality that python adds to regular expressions but should be readily portable to java. I also have canonicalization routines that form canonical graphs and do graph walking if any one is interested. The various algorithms are pretty easy to implement and the canonicalization routines that find the symettry orders for graph walking should also be portable without too much trouble. All of this code is free by the way. I'll make the beta available if anyone is interested. Documentation is a little sparse but this would give me a kick in the pants, so to speak. I am planning on porting the CDK depiction codes to python if I ever get the time btw. Aromaticity detection is causing me the largest amount of trouble but there is an easy solution if you actually trust the smiles string being entered to have the aromaticity specified. Essentially if all atoms in a SSSR ring are specified to be aromatic and all the bonds are unspecified then the ring is aromatic. Daylight uses unspecifed bonds to be either single or double so you will never see a double or triple bond in a canonical aromatic ring. Later version of daylight specifically indicate single bonds between aromatic atoms that are not aromatic bonds themselves to avoid ambiguity. Most other systems use substructure detection to allow the user to specify what types of rings can be aromatic and performs these searches for aromaticity detection. If any one is willing to help me define daylight's version of aromaticy using the Huckel 4n+2 rule this could be used in both projects. Basically single cycle (almost) planar systems that contain 4n+2 pi electrons where n is a non negative integer exhibit aromatic character where n is usually limited to 0 to 5. daylight has a pretty good description of aromaticity at http://www.daylight.com/dayhtml/smiles/smiles-convent.html So if any one wants to help me define the rules indicating how many pi-electrons differenct atoms contribute in a ring, forming aromaticity rules based on daylight's conventions should be doable. > >Then you need to walk the graph and output the stuff as SMILES. Really, >it should not be too much of an effort to implement the few feature >needed for writing and reading the 99% of all naturally occuring SMILES, >but if you want to implement the full SMILES specification... > >Cheers, > >Chris > >-- >Dr. Christoph Steinbeck (http://www.ice.mpg.de/departments/ChemInf) >MPI of Chemical Ecology, Carl-Zeiss-Promenade 10, 07745 Jena, Germany >Tel: +49(0)3641 643644 - Fax: +49(0)3641 643665 > >What is man but that lofty spirit - that sense of enterprise. >... Kirk, "I, Mudd," stardate 4513.3.. > >_______________________________________________ >Cdk-devel mailing list >Cdk...@li... >https://lists.sourceforge.net/lists/listinfo/cdk-devel > |