From: gilleain t. <gil...@gm...> - 2010-03-21 22:47:48
|
Hi, When trying to read in the smiles of this molecule: http://www.chemspider.com/RecordView.aspx?rid=0950d62d-f5b3-4aea-a993-76a3d78287ce (the smiles is on that page, but I'll reproduce it here: c1ccccc1P(c2ccccc2)(c3ccccc3)[Rh](P(c4ccccc4)(c5ccccc5)c6ccccc6)( P(c7ccccc7)(c8ccccc8)c9ccccc9)P(c%10ccccc%10)(c%11ccccc%11)c%12cc ccc%12 ) the cdk atom typer fails to type the Rhodium atom or the four phosphates. My question is this - is this a bug, or a feature request? Are these considered to be missing types, or should the code be picking it up, and isn't? gilleain |
From: Egon W. <ego...@gm...> - 2010-03-21 22:54:54
|
Hi Gilleain, On Sun, Mar 21, 2010 at 11:47 PM, gilleain torrance <gil...@gm...> wrote: > When trying to read in the smiles of this molecule: > > http://www.chemspider.com/RecordView.aspx?rid=0950d62d-f5b3-4aea-a993-76a3d78287ce > > (the smiles is on that page, but I'll reproduce it here: > > c1ccccc1P(c2ccccc2)(c3ccccc3)[Rh](P(c4ccccc4)(c5ccccc5)c6ccccc6)( > P(c7ccccc7)(c8ccccc8)c9ccccc9)P(c%10ccccc%10)(c%11ccccc%11)c%12cc > ccc%12 > > ) the cdk atom typer fails to type the Rhodium atom or the four phosphates. Yeah, both might very well be missing. The Rhodium certainly. > My question is this - is this a bug, or a feature request? Bit of both, I guess... the CDK has never had information on them [0], but never really failed on them either... they were never supported. In that sense, a feature request. Then again, it simply should just support them, so I'll take it as a bug against [1]. > Are these considered to be missing types, or should the code be picking it up, > and isn't? Please provide appropriate atom type details, like in: <at:AtomType rdf:ID="C.sp3"> <at:hasElement rdf:resource="&elem;C"/> <at:hybridization rdf:resource="&at;sp3"/> <at:formalCharge>0</at:formalCharge> <at:lonePairCount>0</at:lonePairCount> <at:formalNeighbourCount>4</at:formalNeighbourCount> <at:piBondCount>0</at:piBondCount> </at:AtomType> Egon 0.http://chem-bla-ics.blogspot.com/2007/07/atom-typing-in-cdk.html 1.http://github.com/egonw/cdk/blob/master/src/main/org/openscience/cdk/dict/data/cdk-atom-types.owl -- Post-doc @ Uppsala University Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg Homepage: http://egonw.github.com/ Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Daniel Z. <Zah...@ma...> - 2010-03-23 13:20:45
|
On Mar 21, 2010, at 6:54 PM, Egon Willighagen wrote: > Hi Gilleain, > > On Sun, Mar 21, 2010 at 11:47 PM, gilleain torrance > <gil...@gm...> wrote: >> When trying to read in the smiles of this molecule: >> >> http://www.chemspider.com/RecordView.aspx?rid=0950d62d-f5b3-4aea-a993-76a3d78287ce >> >> (the smiles is on that page, but I'll reproduce it here: >> >> c1ccccc1P(c2ccccc2)(c3ccccc3)[Rh](P(c4ccccc4)(c5ccccc5)c6ccccc6)( >> P(c7ccccc7)(c8ccccc8)c9ccccc9)P(c%10ccccc%10)(c%11ccccc%11)c%12cc >> ccc%12 >> >> ) the cdk atom typer fails to type the Rhodium atom or the four >> phosphates. > > Yeah, both might very well be missing. The Rhodium certainly. > >> My question is this - is this a bug, or a feature request? > > Bit of both, I guess... the CDK has never had information on them [0], > but never really failed on them either... they were never supported. > In that sense, a feature request. Then again, it simply should just > support them, so I'll take it as a bug against [1]. I think we should be a bit careful. I don't like the idea of things failing silently and if atom types are added, they should have some minimum amount of actual useful parameters, and not just be generic place holders that just prevent an exception from being thrown. Looking at atoms that can not be typed is really useful to us (see below); first because it helps us debug structures and second because it allows us to separate compounds for which calculations can be reasonably relied on from compounds where the calculations will be at best a rough estimate. There are of course a number of different strategies for providing this kind of information and I'm not very particular about the exact method, but I would be strongly against a situation where it is impossible to tell whether an atom has been given "real" parameters or some kind of place holding parameters. > >> Are these considered to be missing types, or should the code be >> picking it up, >> and isn't? > I've just completed a major re-extraction of structures from the DTP internal system. I was going to wait until all the structures had been uploaded to our web site and PubChem before I posted a summary, but it seems appropriate to post the results in this discussion. Background: DTP has been collecting compounds for over 50 years and the storage of structural information has used hand drawn pictures on 3X5 cards and practically the whole history of computer representation of chemical structures. For a most of this history the molecular formula was entered independently of the structure and thus calculating the molecular formula from the structure and comparing it to the stored formula provides at least a first pass check on whether all conversions from various structure formats has resulted in something that is consistent with expected structure. Method: 1) extract structures (2D coordinates) from internal database in SD file format 2) read Sd file with JUMBO code 3) convert CMLMolecule to CDK molecule 4) use CDK to separate into disconnected parts, if any 5) for each part, parameterize, add hydrogens, calculate formula, and compare to stored formula 6) write result to database. Results: Using CDK 1.2.5 Structures processed = 265195 Unparameterized Atom in structure = 6698 (2.5%) Unparameterized Atom (atom or ligand atoms not in first or second row) = 1134 (0.4%) Count of Unparameterized for some elements C = 24 N = 237 O = 68 P = 346 H = 7 S = 89 Cl = 36 F = 36 B = 291 Other "No comparison" = 269 failed comparison = 2747 (1.0%) passed comparison = 255481 (96.3%) I haven't looked real closely at what is left, but it seems to be mostly down to little corner cases that will need to be looked at one by one. A significant number of the N failures are just incorrect structures. Boron is first row, but complicated. The H failures I think are all in three center bonds with boron. Hexafluorophospate and perchlorate accounts for a bunch and may be worth taking a look at. Of course at some point I'll have to look closely at whether some of the failures and some of the successes are due to failures in parameterization, but overall I'm very satisfied with the results. As I said, I still need to load this into PubChem and our website. I'll post here when that is finished. It will probably be a week or so. If anybody is raring to go and wants to get something sooner, let me know and I can post subsets as SD files. DanZ /******************************************** * Daniel Zaharevitz * Chief, Information Technology Branch * Developmental Therapeutics Program * National Cancer Institute * Zah...@ma... * ********************************************/ |
From: Egon W. <ego...@gm...> - 2010-03-26 16:03:58
|
On Tue, Mar 23, 2010 at 2:06 PM, Daniel Zaharevitz <Zah...@ma...> wrote: > On Mar 21, 2010, at 6:54 PM, Egon Willighagen wrote: >>> the cdk atom typer fails to type the Rhodium atom or the four >>> phosphates. >> >> Yeah, both might very well be missing. The Rhodium certainly. >> >>> My question is this - is this a bug, or a feature request? >> >> Bit of both, I guess... the CDK has never had information on them [0], >> but never really failed on them either... they were never supported. >> In that sense, a feature request. Then again, it simply should just >> support them, so I'll take it as a bug against [1]. > > I think we should be a bit careful. I don't like the idea of things > failing silently Agreed. > and if atom types are added, they should have some > minimum amount of actual useful parameters, Agreed. The CDK atom type list defines such minimum information, and is separate from the source code. See: http://github.com/egonw/cdk/blob/master/src/main/org/openscience/cdk/dict/data/cdk-atom-types.owl > and not just be generic > place holders that just prevent an exception from being thrown. Indeed. Actually, I would encourage to everyone to run the atom typing as one of the first things in your source code, and single a warning if one or more atom types is not recognized. The set of CDK-Taverna nodes has such a filter. > Looking at atoms that can not be typed is really useful to us (see > below); first because it helps us debug structures and second because > it allows us to separate compounds for which calculations can be > reasonably relied on from compounds where the calculations will be at > best a rough estimate. That is really important to note indeed. When the CDK does not recognize the atom type, any calculation result is basically undefined. > There are of course a number of different > strategies for providing this kind of information and I'm not very > particular about the exact method, but I would be strongly against a > situation where it is impossible to tell whether an atom has been > given "real" parameters or some kind of place holding parameters. >> >>> Are these considered to be missing types, or should the code be >>> picking it up, and isn't? > > I've just completed a major re-extraction of structures from the DTP > internal system. I was going to wait until all the structures had been > uploaded to our web site and PubChem before I posted a summary, but it > seems appropriate to post the results in this discussion. Very much appreciated! > Background: DTP has been collecting compounds for over 50 years and > the storage of structural information has used hand drawn pictures on > 3X5 cards and practically the whole history of computer > representation of chemical structures. For a most of this history the > molecular formula was entered independently of the structure and thus > calculating the molecular formula from the structure and comparing it > to the stored formula provides at least a first pass check on whether > all conversions from various structure formats has resulted in > something that is consistent with expected structure. > > Method: > 1) extract structures (2D coordinates) from internal database in SD > file format > 2) read Sd file with JUMBO code > 3) convert CMLMolecule to CDK molecule What was the reason not to use the CDK MDL reader? > 4) use CDK to separate into disconnected parts, if any > 5) for each part, parameterize, add hydrogens, calculate formula, and > compare to stored formula > 6) write result to database. > > Results: > Using CDK 1.2.5 > Structures processed = 265195 > Unparameterized Atom in structure = 6698 (2.5%) > Unparameterized Atom (atom or ligand atoms not in first or second row) > = 1134 (0.4%) > Count of Unparameterized for some elements > C = 24 > N = 237 Yes, these are the most tricky, in particular detect when a in-ring nitrogen is N.sp3 or planar like N.sp2 or N.planar3... CDK 1.2.6 (soon to be released) already has a patch for one bug in that respect. > O = 68 > P = 346 > H = 7 > S = 89 > Cl = 36 > F = 36 > B = 291 > Other "No comparison" = 269 > failed comparison = 2747 (1.0%) > passed comparison = 255481 (96.3%) > > I haven't looked real closely at what is left, but it seems to be > mostly down to little corner cases that will need to be looked at one > by one. A significant number of the N failures are just incorrect > structures. Boron is first row, but complicated. Feel free to point me to typical boron compounds for which I can define proper atom types, including hybridization, charge, number of lone pairs, number of bonded neighbors, and the number of electrons it can contribute to a delocalized system (e.g. for aromaticity detection). > The H failures I think are all in three center bonds with boron. Indeed. I have not worked out yet how to deal with multi-center bonds... no 'convention' has yet been defined in the CDK. > Hexafluorophospate and > perchlorate accounts for a bunch and may be worth taking a look at. Here too, please point me to relevant structures. In the past, I have been looking at MolBase for sanity checking: http://winter.group.shef.ac.uk/molbase/ What I like to do even more, if link each CDK atom type to a paper describing the crystal structure where the atom type is found (preferably the first record of that type), but have not gotten around to that. > Of > course at some point I'll have to look closely at whether some of the > failures and some of the successes are due to failures in > parameterization, but overall I'm very satisfied with the results. Really happy to hear that! Most recent bug reports have been for missing atom types, e.g. for platinum: http://sourceforge.net/tracker/?func=detail&aid=2966490&group_id=20024&atid=320024 > As I said, I still need to load this into PubChem and our website. I'll > post here when that is finished. It will probably be a week or so. If > anybody is raring to go and wants to get something sooner, let me know > and I can post subsets as SD files. No rush from my side, but feel free to post the more occurring fails which simple and clear atom types, then I can try to get that fixed before 1.2.6... Egon -- Post-doc @ Uppsala University Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg Homepage: http://egonw.github.com/ Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |