You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(24) |
Oct
(105) |
Nov
(91) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
|
Feb
(2) |
Mar
(3) |
Apr
|
May
(25) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(5) |
Nov
(8) |
Dec
|
2009 |
Jan
(33) |
Feb
(20) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(20) |
2010 |
Jan
(2) |
Feb
(9) |
Mar
(3) |
Apr
(4) |
May
(27) |
Jun
(40) |
Jul
|
Aug
|
Sep
(13) |
Oct
(4) |
Nov
(5) |
Dec
(1) |
2011 |
Jan
(16) |
Feb
|
Mar
(15) |
Apr
|
May
(4) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(11) |
Nov
(2) |
Dec
|
2012 |
Jan
|
Feb
|
Mar
(5) |
Apr
(35) |
May
(14) |
Jun
|
Jul
|
Aug
(1) |
Sep
(27) |
Oct
(9) |
Nov
(2) |
Dec
(5) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(9) |
Aug
(8) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
(5) |
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(11) |
Sep
|
Oct
|
Nov
|
Dec
(7) |
2016 |
Jan
|
Feb
|
Mar
(113) |
Apr
|
May
(15) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2017 |
Jan
|
Feb
(17) |
Mar
(13) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(10) |
Nov
|
Dec
(15) |
2019 |
Jan
(7) |
Feb
|
Mar
|
Apr
|
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: richard a. <ric...@ya...> - 2007-10-01 21:50:30
|
--- Chris Morley <c.m...@ds...> wrote: > richard apodaca wrote: > > --- Craig James <cra...@em...> > wrote: > > > >> richard apodaca wrote: > >>> The same applies when discussing aromaticity in > >>> SMILES. The SMILES valence model is just too > >>> simplistic to be of much use other than as a > >>> notational shorthand. > >> The SMILES definition of aromaticity is > specifically to > >> handle this problem, to make it so that there is > a > >> single canonical way to represent a ring system. > >> > >> The hydrogen business is just a side effect. > > > > Yes, it's clear how lower case atom labels > simplify > > canonicalization. > > > > But this doesn't, on face value, exclude the > > interpretation I put forward, which is (from the > > perspective of a SMILES parser): > > > > "if you see a lower case atom > > symbol, subtract one from the implicit hydrogen > count > > you would otherwise assign." > > > > or from the perspective of a SMILES writer: > > > > "use lower case atom symbols to indicate a > reduction > > by one in the number of implicit hydrogen atoms > that > > would otherwise be represented." > > > > These two simple definitions encompass all uses of > the > > lower case SMILES atom notation I'm aware of. And > they > > bypass the sticky problem of trying to work out > what > > aromaticity really means. (We don't really want to > go > > there, do we?) > > > > Or am I missing something? > > > > So, these would all be legal under this simplified > > definition: > > > > Ccc - propylene > > c1ccc1 - cyclobutadiene > > c1ccC#Cc1 - benzyne > > co - formaldehyde > > cccc - butadiene > > Of course, this reduced hydrogen count > interpretation ties in nicely > with the use of "c" as a radical center. > Cc - ethyl radical > ccc - allyl radical > > But maybe you want to represent furan as c1cocc1 > where the o is not > H-deficient (Daylight Depict accepts this, as well > as c1cOcc1). But then > you might write pyrrole as c1cncc1 (which it does > not; it accepts > c1c[nH]cc1 and c1c[NH]cc1). I think all these forms > should be > acceptable, since they are unambiguous and are the > sort of structure > likely to be written by people without decades of > experience in > cheminformatics. I agree with Geoff: read many > variants, write a > recommended form. Conversion tools like OpenBabel > would then be the Tidy > application previously mentioned. I hadn't thought of that. Using the simplified rule for lower-case atom labels eliminates the need to write the brackets around pyrrole/indole nitrogen (c1cncc1 is acceptable). I like the idea of c1cocc1 and c1cOcc1 both being acceptable for furan as well (the underlying assumption being that reduction in implicit hydrogen count can never lead to a negative number). And the ability to simply represent radicals is an added bonus (e.g. TEMPO): CCo - ethoxide radical (assuming that Co will never be interpreted as cobalt because it's not in brackets). ___________________________________ Richard L. Apodaca http://depth-first.com Blog http://metamolecular.com Company |
From: Andrew D. <da...@da...> - 2007-10-01 21:20:56
|
Not really on ferrocenes but I was looking at some SMILES from the NCI data set, converted by OpenEye. Why? Because seeing a SMILES like [Fe]12345678(C9C1=C2C3=C89)C1C6=C5C4=C71 2033 is just cool. A bit further on down I found B123456B789%10(B%11%12(B%13%14%15B%12%12%16B2%10%11B3%122B%16% 133B622B3%14([H]%15)[H]2)[H]8)B23(B68%10B33%11B192B431B%1162B511B28 ([H]%10)[H]1)[H]7 98976 Organo-metallic chemists are da' bomb! Andrew da...@da... |
From: Chris M. <c.m...@ds...> - 2007-10-01 20:59:53
|
richard apodaca wrote: > --- Craig James <cra...@em...> wrote: > >> richard apodaca wrote: >>> The same applies when discussing aromaticity in >>> SMILES. The SMILES valence model is just too >>> simplistic to be of much use other than as a >>> notational shorthand. >> The SMILES definition of aromaticity is specifically to >> handle this problem, to make it so that there is a >> single canonical way to represent a ring system. >> >> The hydrogen business is just a side effect. > > Yes, it's clear how lower case atom labels simplify > canonicalization. > > But this doesn't, on face value, exclude the > interpretation I put forward, which is (from the > perspective of a SMILES parser): > > "if you see a lower case atom > symbol, subtract one from the implicit hydrogen count > you would otherwise assign." > > or from the perspective of a SMILES writer: > > "use lower case atom symbols to indicate a reduction > by one in the number of implicit hydrogen atoms that > would otherwise be represented." > > These two simple definitions encompass all uses of the > lower case SMILES atom notation I'm aware of. And they > bypass the sticky problem of trying to work out what > aromaticity really means. (We don't really want to go > there, do we?) > > Or am I missing something? > > So, these would all be legal under this simplified > definition: > > Ccc - propylene > c1ccc1 - cyclobutadiene > c1ccC#Cc1 - benzyne > co - formaldehyde > cccc - butadiene Of course, this reduced hydrogen count interpretation ties in nicely with the use of "c" as a radical center. Cc - ethyl radical ccc - allyl radical But maybe you want to represent furan as c1cocc1 where the o is not H-deficient (Daylight Depict accepts this, as well as c1cOcc1). But then you might write pyrrole as c1cncc1 (which it does not; it accepts c1c[nH]cc1 and c1c[NH]cc1). I think all these forms should be acceptable, since they are unambiguous and are the sort of structure likely to be written by people without decades of experience in cheminformatics. I agree with Geoff: read many variants, write a recommended form. Conversion tools like OpenBabel would then be the Tidy application previously mentioned. As well as being a compact unambiguous representation, SMILES should also strive to convey information to the human chemist. (An advantage over for InChI which makes less effort.) So it would be a pity if forms like -C(C)(C)(C) for the t-butyl group, which emphasises the equivalence of the methyl groups, might be outlawed for what appear to be rather purist reasons. On the input side at least I think pragmatism is more important. Chris |
From: Andrew D. <da...@da...> - 2007-10-01 20:55:23
|
On Oct 1, 2007, at 5:44 PM, richard apodaca wrote: > I remember something quoted, and in turn quoted by > Peter Murray-Rust, to the effect that all there are is > atoms - everything else is imagination. Dave Weininger's definition was "aromatic means that it has a smell." My reinterpretation is that "aromaticity stinks." :) Andrew da...@da... |
From: Craig J. <cra...@em...> - 2007-10-01 20:36:31
|
john van drie wrote: > Years ago, when we were starting BioCAD and implementing one of the first > comm'l non-Daylight SMILES parsers... Just for fun: Who wrote the first one? Who wrote it in the most obscure language? I may be able to claim both: I wrote a SMILES parser in Common LISP in 1984 while at HP Labs. I think it was based on the Daylight manuals for their VMS system. Dave and Art Weininger installed their system at our site, and 'tho we didn't use it much, we really liked SMILES. But this "claim to fame" never saw the commercial light of day, so nobody's heard of it. Craig |
From: Craig J. <cra...@em...> - 2007-10-01 20:25:33
|
Greg Landrum wrote: > I would humbly suggest that there is no way that we're going to be > able to code many organometallic complexes in a way that reasonably > reflects the actual bonding and chemistry without making substantial > changes to SMILES. I mentioned this in passing before, but you really > need in many cases to be able to include some sense of directionality > in the bonds, i.e. there are bonds that affect the valence of the atom > at the end of the bond, but have no impact on the valence of atoms at > the beginning of the bond. > > I think we *could* come up with a form of SMILES to represent > organometallic species (and I have thought about this in the past), > but it's definitely going to require extending the language; my > understanding is that we wanted to avoid that at this point. Then we should start this out as a "best practices" suggestion. If, as time goes by, a clear set of rules emerges that can be stated precisely, and implemented readily, we can turn it into part of the specification at that point. So, what is the best way to represent ferrocenes? What about other metal compounds? Craig |
From: Greg L. <gre...@gm...> - 2007-10-01 18:31:35
|
On 10/1/07, Craig James <cra...@em...> wrote: > john van drie wrote: > > For example, how would you encode ferrocene? > > This is a really good question. One of the biggest messes in cheminformatics is the encoding of organo-metallic complexes. Ferrocene is commonly represented three different ways: with 10 bonds to the Fe, as an ionic (disconnected) structure, and as bivalent c1cccc1-Fe-c1cccc1 (with some charges thrown around to make the cycles aromatic -- forgive my chemistry). I'm sure there are other ways, too. You'll often find the same compound three times in a single database, even from a single source. > > The same problem applies to all sorts of metal complexes: copper, magnesium, zinc, gold, you name it, it's a mess in cheminformatics systems. This goes back to the inadequacy of the valence model itself, but that doesn't mean we can't make some progress. > > One solution would be to have a "best practices" that suggests certain ways to do things. Is there a more formal way we could handle these cases? > I would humbly suggest that there is no way that we're going to be able to code many organometallic complexes in a way that reasonably reflects the actual bonding and chemistry without making substantial changes to SMILES. I mentioned this in passing before, but you really need in many cases to be able to include some sense of directionality in the bonds, i.e. there are bonds that affect the valence of the atom at the end of the bond, but have no impact on the valence of atoms at the beginning of the bond. I think we *could* come up with a form of SMILES to represent organometallic species (and I have thought about this in the past), but it's definitely going to require extending the language; my understanding is that we wanted to avoid that at this point. -greg |
From: richard a. <ric...@ya...> - 2007-10-01 18:14:24
|
--- Egon Willighagen <ego...@gm...> wrote: > On 10/1/07, richard apodaca <ric...@ya...> > wrote: > > SMILES is what it is. As Craig points out, if > that's > > not enough for the problem at hand, then the > developer > > should choose a more expressive notational system. > > FlexMol I presume? :) Hah! You read my mind. Just joking. Actually, it would be extremely difficult to come up with a line notation that's as expressive as FlexMol. Well, at least difficult for me - I've tried. > > Seriously, this would ask the OpenSMILES > specification to clarify > which chemistry it can and cannot be used for. That > requires accurate > descriptions. I don't quite follow. I'm just saying that SMILES is a simple notational system and that we shouldn't interpret it too literally. If a developer needs literal interpretation of 'aromaticity', then SMILES probably isn't the right tool for the job. > For example, I like this suggested definition: > > - lower case elements can only occur in rings > - the indicate SMILES aromaticity > - where SMILES aromaticity is an artificial concept > > But all the examples passed around show that > converting such SMILES > into something with chemical meaning is difficult > (e.g. it was shown > that cc as 1.5 valence bonds does not work, > something I discovered a > few years ago, which is why CDK dropped the use of > that). I fully agree. Just look at pyrrole as an example where bond order 1.5 doesn't work. > Like John observes, a lot of chemistry can be done > with SMILES, but > certainly not all. That's a shame, because only too > many > chemoinformatics studies (e.g. in QSAR areas) are > expecting SMILES to > accurately describe chemistry. > > It is therefore of utmost importance that programs > can decide that a > SMILES is wrong or not. Suggesting to move away from > SMILES, suggests > a ban on SMILES in many chemoinformatics areas. Very good point. Whatever comes out of these discussions needs to explicitly tell developers what's acceptable and what isn't. And the only way we can do this is to keep the rules as simple as possible. ___________________________________ Richard L. Apodaca http://depth-first.com Blog http://metamolecular.com Company |
From: richard a. <ric...@ya...> - 2007-10-01 18:00:37
|
--- Craig James <cra...@em...> wrote: > richard apodaca wrote: > > The same applies when discussing aromaticity in > > SMILES. The SMILES valence model is just too > > simplistic to be of much use other than as a > > notational shorthand. > > > > As a notational shorthand, lower case atom labels > are > > really saying more about implicit hydrogens than > they > > are about electronics. > > That's not exactly true. Don't forget that SMILES > was invented specifically as a cheminformatics tool, > with canonical SMILES its primary purpose. For > canonical SMILES, it's necessary to have a single > representation for a given molecule, and the > existence of the Kekule' form was a problem. The > SMILES definition of aromaticity is specifically to > handle this problem, to make it so that there is a > single canonical way to represent a ring system. > > The hydrogen business is just a side effect. Yes, it's clear how lower case atom labels simplify canonicalization. But this doesn't, on face value, exclude the interpretation I put forward, which is (from the perspective of a SMILES parser): "if you see a lower case atom symbol, subtract one from the implicit hydrogen count you would otherwise assign." or from the perspective of a SMILES writer: "use lower case atom symbols to indicate a reduction by one in the number of implicit hydrogen atoms that would otherwise be represented." These two simple definitions encompass all uses of the lower case SMILES atom notation I'm aware of. And they bypass the sticky problem of trying to work out what aromaticity really means. (We don't really want to go there, do we?) Or am I missing something? So, these would all be legal under this simplified definition: Ccc - propylene c1ccc1 - cyclobutadiene c1ccC#Cc1 - benzyne co - formaldehyde cccc - butadiene Cheers, Rich ___________________________________ Richard L. Apodaca http://depth-first.com Blog http://metamolecular.com Company |
From: Geoffrey H. <ge...@ge...> - 2007-10-01 17:55:08
|
On Oct 1, 2007, at 1:49 PM, Craig James wrote: > john van drie wrote: >> For example, how would you encode ferrocene? ... > One solution would be to have a "best practices" that suggests > certain ways to do things. Is there a more formal way we could > handle these cases? Well, as you've said before, if we have a canonical representation, that certainly produces a "best practices" result. A parser can accept multiple inputs, but produces only one "best practices" output. Cheers, -Geoff |
From: Craig J. <cra...@em...> - 2007-10-01 17:47:01
|
john van drie wrote: > For example, how would you encode ferrocene? This is a really good question. One of the biggest messes in cheminformatics is the encoding of organo-metallic complexes. Ferrocene is commonly represented three different ways: with 10 bonds to the Fe, as an ionic (disconnected) structure, and as bivalent c1cccc1-Fe-c1cccc1 (with some charges thrown around to make the cycles aromatic -- forgive my chemistry). I'm sure there are other ways, too. You'll often find the same compound three times in a single database, even from a single source. The same problem applies to all sorts of metal complexes: copper, magnesium, zinc, gold, you name it, it's a mess in cheminformatics systems. This goes back to the inadequacy of the valence model itself, but that doesn't mean we can't make some progress. One solution would be to have a "best practices" that suggests certain ways to do things. Is there a more formal way we could handle these cases? Craig |
From: Craig J. <cra...@em...> - 2007-10-01 17:46:15
|
john van drie wrote: > Egon W suggested I chime in on this discussion It's great to have you on board, and thanks for your great analysis of the "sp2 vs. just-plain-aromatic" debate. > I apologize if I > don't follow the right style for such things, and apologize doubly if I > sound too dogmatic. I am eager to see the right things happen here. Quite the contrary, your remarks are on the mark and as far as dogmatism, not at all. I'm please to see that this is a remarkably civilized group considering how these things can go. > In my opinion, it makes no sense to assert that lower-case implies > sp2-hybridization. If that were taken literally, you'd be marking all > carbonyls, etc. with lower case. ... > > I'd assert that 'c' means an atom is regarded as aromatic, and that 'c' > applied to exocyclic atoms is nonsensical. Several others (most notably Andrew) are on your side on this. I'm leaning strongly in this direction, 'tho I started in the "sp2 camp". My guess is that Daylight (Dave) put the "c means sp2" interpretation in to handle c1ccc1, and it was just an amusing side effect that the parser would also parse aliphatic systems. So the definition of a lowercase letter would be, "This atom is aromatic", and the same for a ':' bond. After the SMILES is parsed, the system must count electrons, determine which default (unspecified bonds) are aromatic and which are single (there have been some great examples here, which I'll probably incorporate as examples in the spec), and if assignment is impossible, reject the SMILES as invalid. Now the trick is to come up with a formal specification for Hueckel's rule, and a formal specification for which elements are allowed to be aromatic. > One of the confusions I sense inherent in this discussion is 'what is the > intended range of applicability of SMILES?'. If you want to extend it to > all of inorganic chemistry, you're in trouble (Dave W went down this road > in the late '80's, and was shocked to discover how complex > inorg. chem. is). You guys will have to help me out if we want this sort of text included. What IS the intended range of applicability of SMILES? Andrew points out that this sort of explanatory text should probably be in a separate document from the formal spec itself. But I think it is important. One of the problems with SMILES has been exactly this: No formal statement of the underlying applicability and intended use. Because of this, it's been extended in inappropriate ways, or in conflicting ways, by different parties. Such a statement would also guide this discussion. > It's interesting to go back to the very first publication on SMILES > (Anderson, Weininger and Veith, ~1985). The emphasis is on *simplicity*, > and chemist-friendliness. I think it's important to adhere to that > original inspiration, even as one tries to codify it rigorously in an > open-standard. Right on! Craig |
From: Egon W. <ego...@gm...> - 2007-10-01 17:05:31
|
On 10/1/07, richard apodaca <ric...@ya...> wrote: > SMILES is what it is. As Craig points out, if that's > not enough for the problem at hand, then the developer > should choose a more expressive notational system. FlexMol I presume? :) Seriously, this would ask the OpenSMILES specification to clarify which chemistry it can and cannot be used for. That requires accurate descriptions. For example, I like this suggested definition: - lower case elements can only occur in rings - the indicate SMILES aromaticity - where SMILES aromaticity is an artificial concept But all the examples passed around show that converting such SMILES into something with chemical meaning is difficult (e.g. it was shown that cc as 1.5 valence bonds does not work, something I discovered a few years ago, which is why CDK dropped the use of that). Like John observes, a lot of chemistry can be done with SMILES, but certainly not all. That's a shame, because only too many chemoinformatics studies (e.g. in QSAR areas) are expecting SMILES to accurately describe chemistry. It is therefore of utmost importance that programs can decide that a SMILES is wrong or not. Suggesting to move away from SMILES, suggests a ban on SMILES in many chemoinformatics areas. Egon -- ---- http://chem-bla-ics.blogspot.com/ |
From: Craig J. <cra...@em...> - 2007-10-01 16:54:13
|
> I remember something quoted, and in turn quoted by > Peter Murray-Rust, to the effect that all there are is > atoms - everything else is imagination. Actually that was something I put into the Daylight SMILES Theory Manual a long time ago: There are atoms and space. Everything else is opinion. -- Democritus Which is remarkable when you consider that Democritus lived 460 BC - ca 370 BC, and the Greek Rationalist Philosophers were only beginning to formulate the idea that there was some indivisible unit of matter. It expresses a profound truth, which is also captured by the great philosopher and mathemetician Descartes: Don't confuse the map and the terrain. --- Rene' Descartes In other words: When you're using a model to represent something, never forget that it's just a model of reality, and it's not perfect. Craig |
From: john v. d. <joh...@mi...> - 2007-10-01 16:40:08
|
Egon W suggested I chime in on this discussion around whether 'c' means aro= matic or sp2-hybridization (or other things...). To me, this is a simple m= atter, and I was surprised first to see the website at the uni-koeln that w= e've all seen, and I'm doubly surprised to see how much attention has alrea= dy been paid on this discussion group. First, I applaud this effort by Craig James to develop an open standard for= SMILES. This is a great idea, and I'm sure it'll take some cycling around= to get everything right. I think this was Dave W's original intention 20 = years ago, and somehow we all got sidetracked. This is definitely somethin= g whose time has come. In my opinion, it makes no sense to assert that lower-case implies sp2-hybr= idization. If that were taken literally, you'd be marking all carbonyls, e= tc. with lower case. How many times have you seen aspirin written oc(=3Do)= c1ccccc1Oc(=3Do)C?? Never, I'd bet. This discussion group just touched th= e tip of the iceberg in terms of the problems inherent in the idea that 'c'= =3D=3D> sp2-hybridization. I'd assert that 'c' means an atom is regarded as aromatic, and that 'c' app= lied to exocyclic atoms is nonsensical. (If someone wants to add additiona= l meanings, like it being a radical, that's a totally new kettle of fish). = It should be kept in mind that aromaticity is not a binary thing in chemis= try - traditionally, it's a measure of the reactivity of a center; with NMR= , one can also define aromaticity in terms of certain ranges of downfield s= hifts indicating aromaticity. All of these are inherently continuously-var= iable, and one is approximating it in this binary way by using lower-case. = =20 Years ago, when we were starting BioCAD and implementing one of the first c= omm'l non-Daylight SMILES parsers, I had to dig into this, and I discovered= that the chemists' notion of aromaticity is imprecise. As this discussion= group observed, different packages will define aromaticity in different wa= ys. But, in the spirit of getting the most-frequently encountered things r= ight, it's straightforward to do this. If faced with a cycle of sp2-hybrid= ized atoms, and the count of pi-electrons in that cycle obeys the Huckel 4n= +2 rule, then it's aromatic. One can define the 95% of the cases in case l= ogic, to minimize the amount of 'chemistry' inherent in a parser. (I'd bet= the Openeye people and the Tripos people are doing something like that, wi= th slightly different lists in the case logic). You'll always be able to f= ind unusual cases that defeat any such case logic (e.g. the mol's who ought= to be aromatic by any topological analysis, but whose 3D structure breaks = the planarity and hence the aromaticity). I wish I still had my case logic= from Catalyst's SMILES parser, but do not . . . One of the confusions I sense inherent in this discussion is 'what is the i= ntended range of applicability of SMILES?'. If you want to extend it to al= l of inorganic chemistry, you're in trouble (Dave W went down this road in = the late '80's, and was shocked to discover how complex inorg. chem. is). = For example, how would you encode ferrocene? Each carbon atom there is aro= matic, each carbon-Fe bond is equivalent, Fe makes a total of 10 bonds, etc= . You don't want to go there, IMHO. I'd assert that you want SMILES to co= ver the space of organic molecules that appear in the course of drug discov= ery. (Admittedly, there is ONE drug that is an inorganic complex, cis-plat= in, but I think you can set those aside). It looks to me like the main 'evidence' in support of the 'c' =3D=3D> sp2 i= dea is that typing 'cc' into DEPICT gives ethylene. This is pretty thin. = DEPICT has always been pretty promiscuous and loose. Something I think thi= s discussion group needs to clarify a bit more is the notion that there are= (at least) three levels of SMILES. Taking as an example, pyridine, - what one is allowed to type in n1ccccc1, N1C=3DCC=3DC=3DC=3D1, etc. - how things are, in effect, stored internally, all equivalent, n1ccccc1, = c1ncccc1, c1cnccc1, etc. - canonical SMILES, unique encoding in db, n1ccccc1 My guess if that if you think this thru carefully, you'll find the "rule" '= c' =3D=3D> sp2 is very difficult to make into a fully consistent system. I= think the rules 'c' =3D=3D> aromatic and exocyclic 'c' =3D=3D> verboten ar= e straightforward to turn into a comprehensive system. It's interesting to go back to the very first publication on SMILES (Anders= on, Weininger and Veith, ~1985). The emphasis is on *simplicity*, and chem= ist-friendliness. I think it's important to adhere to that original inspir= ation, even as one tries to codify it rigorously in an open-standard. John Van Drie p.s. This is my first post ever to a discussion group. I apologize if I do= n't follow the right style for such things, and apologize doubly if I sound= too dogmatic. I am eager to see the right things happen here. |
From: richard a. <ric...@ya...> - 2007-10-01 16:38:18
|
--- Andrew Dalke <da...@da...> wrote: > On Oct 1, 2007, at 6:52 AM, Craig James wrote: > > It's hard to respond to all of the points re: > aromaticity. So let > > me propose this: As I see it, there are two (and > only two) current > > ideas under consideration. > > > > 1. 'c' (or any lowercase) means sp2, wherever it > occurs. > > > > 2. 'c' (or any lowercase) means aromatic, period. > .. > > I think I wrote a decent definition of #1 above in > the current > > draft of the specification. The right way to > approach #2 is for > > someone to write an equally clear definition. As a card-carrying synthetic organic chemist, I'll take a swing at this one. I remember something quoted, and in turn quoted by Peter Murray-Rust, to the effect that all there are is atoms - everything else is imagination. Aromaticity is one of those more or less imaginary concepts that gets more debatable as the definition gets more general. Synthetic chemists have one model of aromaticity relating mainly to reactivity. Inorganic chemists have their ideas of aromaticity. Theoreticians have a few other ideas with active debate still taking place. Computer scientists have their models. All are valid to the extent that they solve practical problems in specific problem domains. But no definition would satisfy all groups. The same applies when discussing aromaticity in SMILES. The SMILES valence model is just too simplistic to be of much use other than as a notational shorthand. As a notational shorthand, lower case atom labels are really saying more about implicit hydrogens than they are about electronics. The rule basically says: "if you see a lower case atom symbol, subtract one from the implicit hydrogen count you would otherwise assign." Counterexamples? I can't think of any, but they may exist. If this rule causes problems for authors of SMILES parsing software who want to expound on electronic structure, it's not something that SMILES itself can do much about. SMILES is what it is. As Craig points out, if that's not enough for the problem at hand, then the developer should choose a more expressive notational system. Cheers, Rich ___________________________________ Richard L. Apodaca http://depth-first.com Blog http://metamolecular.com Company |
From: Greg L. <gre...@gm...> - 2007-10-01 16:29:37
|
On 10/1/07, Egon Willighagen <ego...@gm...> wrote: > On 10/1/07, Greg Landrum <gre...@gm...> wrote: > > > > Since I seem to be the squeakiest wheel, I will take a swing at coming > > up with a definition and post it here for critic^H^H^H^H^H^Hcomments. > > I will try to have something posted either tonight or tomorrow > > morning. > > Maybe just wait a couple of days... Recently I had a discussion on > this issue with someone who might have written one of the first > implementations of a SMILES parser, who has been talking things > through with the designers of SMILES... he should be registered to > this list by now, and said to comment on the sp2/arom issue very soon. I will happily wait and gather more data as the discussion continues. :-) > BTW, unlike what I might have suggested earlier, this SMILES > specification can best be a solid formulation of the current de facto > SMILES standard. A 'better' would be very useful, but outside the > scope of what we should be doing now. It would be good to clean up > what we currently have; and I also think that backward compatibility > is of strong importance. I agree it's important, but I think that there are so many extensions out there masquerading as SMILES that it's going to be nigh-upon impossible to be backwards compatible with everyone. Particularly since most of those extensions probably aren't formally defined. ick! -greg |
From: Greg L. <gre...@gm...> - 2007-10-01 16:21:26
|
On 10/1/07, Geoffrey Hutchison <ge...@ge...> wrote: > > I'd generally agree, but with one reminder. InChI formally defines > that the InChI=1 is part of the specifier itself. That's how they did > it, and it's going to stay that way. > > OTOH, the general consensus on this list seems to be "let's document > and standardize how SMILES are used." Clearly the cat's out of the > bag -- SMILES aren't used with SMILES=C1CCC1. Phrased this way, I agree. I still think it would be useful property for a parser to support, and something that should be a component of any new format. -greg |
From: Geoffrey H. <ge...@ge...> - 2007-10-01 15:09:24
|
On Oct 1, 2007, at 12:41 AM, Craig James wrote: > But I still say this is a "higher layer" wrapper around SMILES. > The "micro formats" (which are being discussed elsewhere in Blue > Obelisk) should specify in general how to say the string type. We > shouldn't have SMILES use "SMILESv1=c1ccccc1" and InChI use "InChI: > 1/C6H6/c1-2-4-6-5-3-1/h1-6H". I'd generally agree, but with one reminder. InChI formally defines that the InChI=1 is part of the specifier itself. That's how they did it, and it's going to stay that way. OTOH, the general consensus on this list seems to be "let's document and standardize how SMILES are used." Clearly the cat's out of the bag -- SMILES aren't used with SMILES=C1CCC1. So for SMILES, this is clearly a "higher layer" around SMILES or any similar format. Even with InChI, it's not always clear how to incorporate it into a webpage without wrapping, etc. See the various discussions in the Blue Obelisk and InChI archives. Cheers, -Geoff |
From: Craig J. <cra...@em...> - 2007-10-01 14:58:37
|
Andrew Dalke wrote: >>> In the HTML world, cleanup is done with various tools, the >>> most common of which is HTML Tidy. Why not do the same here; >>> provide a cleanup-your-SMILES tool, as a library and command-line >>> program, which others can use. >> Good idea. But that's not really part of the SMILES specification, >> it's >> something that CDK, OB, etc. could provide. > > I agree. But I also think that the "relaxed rules" themselves > are outside of the specification. I tend to agree. At the same time, part of what I want to achieve is to put together "everything that's known about SMILES" in one place. Maybe it should be a "Open SMILES home" web site with all sorts of articles, including this chapter but as a separate article. Craig |
From: Craig J. <cra...@em...> - 2007-10-01 14:49:17
|
Andrew Dalke wrote: > On Oct 1, 2007, at 6:59 AM, Craig James wrote: > >> Greg Landrum wrote: >>> I do not believe so. I think lowercase letters should indicate >>> conjuguated pi systems that are either aromatic or anti-aromatic >>> (though I would not canonicalize an anti-aromatic system to use >>> lower-case letters, but that's a different story). >> The reason anti-aromatic systems use lowercase letters is because >> otherwise there are two valid canonical SMILES for the same >> structure, C1=CC=C1 and C=1C=CC=1. Or to make it more obvious, put >> a few substituents on it. > ... > But in this case the structure is > > C = C > | | > C = C > > All four atoms are in the same equivalency graph so pick one > arbitrarily. There are two bonds, to atoms in the same equivalency > graph so pick the double bond over the single as the atom to > be next in the output, producing a canonical structure of: > > C1=CC=C1 > > There's no need for lowercase here to produce a canonical output. I didn't state the case properly. Since all four atoms and bonds are equivalent, a normalized SMILES that uses single/double bonds won't match a SMARTS that chooses the alternate form. Forget about whether it's "aromatic" or not. The idea was to create a datamodel and language that recognized the equivalence of certain atoms and bonds so that canonicalization and SMARTS matching work in a sensible way. Craig |
From: Andrew D. <da...@da...> - 2007-10-01 13:43:07
|
BTW: http://www.eyesopen.com/docs/release_notes/oechem-1.3.3_releasenotes.txt > 2. Improved support from aromatic boron and aromatic silicon in > OEKekulize. The OEChem toolkit currently doesn't perceive > either > boron or silicon to be aromatic (with any aromaticity > model), but > this enhancement allows us to Kekulize structures so specified. > 3. Added improved support of parsing SMILES containing aromatic > boron > and aromatic silicon, allowing the OEChem toolkit to parse > ``b1ccccc1" (borinine). They don't document it on their SMILES page but they do (apparently) support it. Andrew da...@da... |
From: Andrew D. <da...@da...> - 2007-10-01 13:34:06
|
On Oct 1, 2007, at 9:20 AM, Craig James wrote: > No, that's not a normalize SMILES. The normalized SMILES is the > aromatic form, which does match itself. That's why the word > "normalized" is important. Yes, I see that now in the spec. I still don't think that the concept "normalized" is generally useful. If I want the result in a special form I've got code to do it. > Are you saying that with OpenEye, you can have the same > SMILES mean different things? That doesn't seem like a > good idea. If the OpenSMILES definition is clear, then > the datamodel shouldn't matter. I'm saying that different chemistry models for the same electronic structure give different SMILES. Different chemistry systems, for example, have different ways of defining aromatic. http://www.eyesopen.com/docs/html/pyprog/TriposAtomTyping.html > This function sets the integer atom type field of each atom in a =20 > molecule to its Tripos atom type. This function typically requires =20 > OEAssignAromaticFlags has been called to perceive aromaticity using =20= > the Tripos aromaticity model, i.e. OEAssignAromaticFlags=20 > (mol,OEAroModelTripos). This is required as Tripos considers =20 > compounds such as pyrole, to be aliphatic and so the carbon atoms =20 > of pyrole should be correctly typed as "C.2" not "C.ar". Using the =20 > Tripos aromaticity model, however, is not a strong requirement and =20 > other aromaticity models can be used (for example if C.ar is =20 > desired in pyrole). http://www.eyesopen.com/docs/html/api/OEAssignAromaticFlags.html > void OEAssignAromaticFlags(OEMolBase &mol, > const OEAroModel model =3D = OEAroModelOpenEye, > bool clearflags =3D true, > unsigned int maxpath =3D 0, > bool prune =3D false) > Determine the aromatic atoms and bonds of a molecule. The =20 > aromaticity model to be used is specified by the ``model'' =20 > parameter, which defaults to the OpenEye model of aromaticity. =20 > Other predefined aromaticity models provided by OEChem include =20 > OEAroModelDaylight, OEAroModelTripos, OEAroModelMMFF and =20 > OEAroModelMDL that represent the Daylight, Tripos, MMFF and MDL =20 > definitions respectively. The ``clearflags'' parameter is used to =20 > specify whether this call needs to clear the aromaticity flags =20 > first, using OEClearAromaticFlags. Newly created molecules that =20 > have not had their aromaticity assigned yet can specify false, for =20 > a very small performance advantage. > > The ``maxpath'' parameter allows the user to specify the maximum =20 > path length to consider an aromatic cycle, or zero (the default) to =20= > specify no upper bound on aromatic cycle length. Some formal models =20= > of aromaticity use the value six, limiting aromaticity to six =20 > membered rings like benzene or pyridine. > > The ``prune'' parameter is used to specify whether or not to run a =20 > post-processing step to consider rings with exo-double bonds as not =20= > aromatic. This is also required by some formal models of aromaticity. >> I'm going to be a puritan here and say that aromatic is >> a boolean property of atoms and bonds. It has no deeper >> meaning. If a SMILES parser parses "ccC" then it has read >> 3 atoms, with two bond. The atoms are two aromatic carbons >> and an aliphatic carbon. And that's all it means. > > But it's not a valid SMILES by anyone's definition. Including mine. I wasn't thinking right. The valances don't work right. > I just can't see the usefulness of this. The goal is to > communicate chemical structure unambiguously. Aren't you > throwing that out? If you let the chemistry model be defined > by the application, how can you communicate information? Apparently OpenEye does a good job of it. > The whole point of the original SMILES definition was to pick a > particular model of chemistry, and define a line notation that > represented that model. If the SMILES model of chemistry doesn't > solve a particular problem, then use a different notation that does. The xyz file format, for QM, is simple. Element, x, y, z. Everything else is deducible from that, and the valance bond model is simply an approximation. We want the SMILES structure to be close enough to the real system that people can reasonably infer the real system from the SMILES. But there are other SMILES structures which are close to that same real system. They are perhaps not graph-theoretically identical, and they make different canonical structures, but a chemist would say they are close, as would nature. So we just need to define the chemistry enough that a chemist can agree that it's right enough, and enough that software can convert to the preferred representations used by other valance-model-based chemistry perception tools. In other words, not right, just right enough. That's what I think OpenEye does, and they do it very right indeed. That's also probably why they put in the extra "-" bonds that you don't think are needed. This makes it easier for the other perception codes to figure out if a bond is supposed to be aromatic or not. Perhaps think of it more as a style. You represent this substructure as aromatic, but I don't. As long as I know how to get from the SMILES representation (which encodes both configuration and style) you gave into one I like, we're both happy. I think the way OpenEye has done it, which separates data structure from chemistry model, is right. Yes, the data structure represents a valance model, but such details as the details of aromaticity are not part of that data structure. Here's a data point: http://www.eyesopen.com/docs/cplusprog_1_2/node90.html > The ``aromatic'' property of a bond is a boolean used to > denote whether the bond has been determined to be a member > of an aromatic ring/cycle. The default value is false. That's along the lines of what I want to do ('cause I got the idea from them). I just don't know how it's implemented when parsing certain structures with mixed aromatic/non-aromatic bonds. But OpenEye knows. > I think you're throwing out a key feature of SMILES: Aromaticity > is deduced from the electronic configuration, and is independent > of whether one of various Kekule' or aromatic forms were used in > the SMILES string. In addition, the SMILES parser will reject > SMILES that can't possibly be real molecules. Given how many times I've made [1234C] and [C+100] and C(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)C and other highly non-real molecules, I know this isn't correct. "***" is also a non-real-molecule SMILES. Kekul=E9 is an approximation. Aromatic is an approximation. I can go from Kekul=E9 to aromatic and back. It's just some bit of graph algorithm. In parsing a SMILES string I just want to know if you thought something was aromatic. If I think you have the wrong aromatic model I figure I can always reperceive the structure - convert it to Kekul=E9 and use my code to go back again. Andrew da...@da... |
From: Egon W. <ego...@gm...> - 2007-10-01 13:03:42
|
On 10/1/07, Andrew Dalke <da...@da...> wrote: > > I put this in as another "placeholder" to point out an ambiguity in > > the interpretation of SMILES: What does it mean if an atom doesn't > > have an isotopic spec? Does it mean: > > > > - The most abundandant isotope (e.g. "C" means "[12CH4]")? > > - The naturally occuring isotopic ratios as measured on Earth? > > - Unspecified isotopes? > > > > My vote is for #2, which I think is what most chemists would > > expect. So we would change the document accordingly. > > While I think it means "undefined", which is #3. For these bits of the OpenSMILES specification it would be good to use the Blue Obelisk Data Repository should be used [1]. Egon 1.DOI:10.1021/ci050400b -- ---- http://chem-bla-ics.blogspot.com/ |
From: Andrew D. <da...@da...> - 2007-10-01 12:58:32
|
On Oct 1, 2007, at 9:04 AM, Craig James wrote: > The point of this section is to make it clear that strict SMILES > should > be the default. A lot of bad SMILES are out there because of bad > SMILES parsers. But it's not the job of the spec to say how the spec should be broken, is it? You're going to get, and you've gotten, a lot of input from people who think various bits of almost-SMILES should be considered as valid. And without good grammar rules you'll have different implementations of the relaxed rules interpret the relaxed rules in different ways. >> In the HTML world, cleanup is done with various tools, the >> most common of which is HTML Tidy. Why not do the same here; >> provide a cleanup-your-SMILES tool, as a library and command-line >> program, which others can use. > > Good idea. But that's not really part of the SMILES specification, > it's > something that CDK, OB, etc. could provide. I agree. But I also think that the "relaxed rules" themselves are outside of the specification. >>> 6.2 Quadruple Bonds >> I would put this under the section "bonds". > > It's not part of the current, widely accepted definition of > SMILES. Shouldn't it be a proposed extension until everone agrees > that it's a good idea? Like C.1CCCC.1 ? :) I mean that I would place the proposed comment in the section about bonds, rather than later on in the text. But that's an organization point that I don't care strongly about. >>> 6.2 Polymers and Crystals >> You have a numbering problem - there are two "6.2"s. >> Regarding this one, I ask that it not be accepted. >> Handling SMARTS matching against such constructs, >> doing visualization, etc. will be very hard. > > Ok, that's why it's in there, to get comments. > > Not only that, but how do you figure out that c&1&2&3 is the same > thing as "c1&1c&2c&3c&4c&5c&6" -- which is the correct canonical > SMILES? But it's been proposed, so I thought I'd throw it in for > completeness. > > Representing polymers and crystals is important, and currently > SMILES has no way to do it. If we reject this proposal, is there > another proposal that serves the need? mmCIF? That's what the xtal people use. CML? Something else? From my memory of dealing with crystal structures from the PDB, I would rather simply not worry about it. SMILES doesn't deal with 3D. SMILES doesn't deal with infinite structures. (What's the molecular weight of c&1&2&3 ? Assuming all C12 ;) SMILES also doesn't deal with percentages, and polymers would be "25% X, 38%Y ... ". Andrew da...@da... |