From: Rajarshi G. <rx...@ps...> - 2005-07-12 20:39:10
|
Hi, I was trying to calculate the AtomHybridizationDescriptor for the following molecule: dan002.sdf MOE2004 3D 5 4 0 0 0 0 0 0 0 0999 V2000 -1.7000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0 0.0500 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5500 0.4580 1.1740 F 0 0 0 0 0 0 0 0 0 0 0 0 0.5500 -1.2450 -0.1900 F 0 0 0 0 0 0 0 0 0 0 0 0 0.5500 0.7880 -0.9830 F 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 2 4 1 0 0 0 0 2 5 1 0 0 0 0 M END $$$$ In the code for the descriptor there are two lines: atm = new HybridizationStateATMatcher(); matched = atm.findMatchingAtomType(container, atom); The problem is that for the Cl atom, matched is NULL. The CDK debug log contains: My ATOM TYPE Cl 1.0 1.0 1 0ATOM TYPE 1.0 1.0 0 1ATOM TYPE 0.0 0.0 0 (formatted for ease of reading). Now, the first line indicates the symbol, bond order sum, max bond order and connected atom count for the atom in question and the next 2 lines indicate the possible atom types from the config file. Now, the data file for hybridization atom types contains two possible types for Cl - one for neutral Cl (connected by a single bond to some other atom) and one for an anionic Cl. Clearly, the Cl in the above molecule matches the first type. However the code in HybridizationStateATMatcher returns a successfull match only when the bond order sum max bond order *and* the neighbor count all match. The neighbor count is obtained by calling getFormalNeighborCount() of the AtomType object returned by AtomTypeFactory.getInstance(). My question is 1) Why do we need the neighbor count, if the bond order sum and max bond orders match 2) The data file, hybridization_atomtypes.xml does not contain any neighbor count information (and hence the last value is 0 in the debug output for stored atom types for Cl). Where would this be set for these atom types? Or is it calculated from the max bond order and bond order sum. In which case, why require matching neighbor counts? Thanks, ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- Eureka! -- Archimedes |
From: Rajarshi G. <rx...@ps...> - 2005-07-13 14:55:21
|
On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote: > > > > 1) Why do we need the neighbor count, if the bond order sum and max bond > > orders match > In atomtype matching I am not quite sure, but I think it is not neccessary. In > general you need this to atttach hydrogens to a molecule, when only heavy atoms > are drawn. > > > 2) The data file, hybridization_atomtypes.xml does not contain any > > neighbor count information (and hence the last value is 0 in the debug > > output for stored atom types for Cl). Where would this be set for these > > atom types? Or is it calculated from the max bond order and bond order > > sum. In which case, why require matching neighbor counts? > No, only the values coded in hybridization_atomtypes, are set in the atom. So > the 0 is the default if you like. At this moment I would suggest to remove the > neigbour match. Thanks for the info - I'll go ahead and update CVS with this change ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- 667: The neighbor of the beast. |
From: Rajarshi G. <rx...@ps...> - 2005-07-14 22:21:25
|
On Wed, 2005-07-13 at 10:55 -0400, Rajarshi Guha wrote: > On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote: > > > > > > 1) Why do we need the neighbor count, if the bond order sum and > max bond > > > orders match > > In atomtype matching I am not quite sure, but I think it is not > neccessary. In > > general you need this to atttach hydrogens to a molecule, when only > heavy atoms > > are drawn. > > > > > 2) The data file, hybridization_atomtypes.xml does not contain any > > > neighbor count information (and hence the last value is 0 in the > debug > > > output for stored atom types for Cl). Where would this be set for > these > > > atom types? Or is it calculated from the max bond order and bond > order > > > sum. In which case, why require matching neighbor counts? > > No, only the values coded in hybridization_atomtypes, are set in the > atom. So > > the 0 is the default if you like. At this moment I would suggest to > remove the > > neigbour match. > I've been looking some more at the hybridization_atomtypes.xml file and I see some inconsistency. The Carbon atom types have their formal neighbor counts specified. This is understandable, since otherwise we could not differentiate between Cplus.sp2 and Cminus.sp2 Furthermore, the carbon atom types have a line containing: <scalar dataType="xsd:string" dictRef="cdk:hybridization">sp3</scalar> However apart from carbon, I don't think the types for other atoms have neighbor count or hybridization type information added. Now in my previous mail and reply from Christian it was suggested that we could do away with the check on formal neighbor count. From the above observation regarding carbon atom types, ignoring formal neighbor count will not allow us to differentiate certain atom types for carbon. So what is the current situation with the hybridization data file and the associated matcher class? Is this still work in progress? Is there some other strategy behind the lack of neighbor count info and hybridization state in the config file? >From the code of the matcher it seems that neighbor information should be in the file. The Javadocs also indicate this: "AtomType matcher that deduces the hybridization state of an atom based on the max bond order, bond order sum and neighbor count properties of the Atom." <rant> If work on this class and the associated data is not finished, it'd be nice to have some notice of this somewhere in the docs. >From a QC point of view, this situation is a little frustrating, as rather than write an application with the CDK, I'm having to dig into code that, from all appearances should be working. I have no problem with non-working code - I am very happy to dig into code and fix it if required - but it would be nice if it were noted as such. In addition since other code depends on the hybridization atom type data and matcher, I think its all the more important the such code be marked non-working/incomplete if it is such. </rant> Apologies for the rant and I hope nobody takes this personally. ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- The Heineken Uncertainty Principle: You can never be sure how many beers you had last night. |
From: Uli F. <u.f...@ch...> - 2005-07-14 23:55:05
|
> If work on this class and the associated data is not finished, it'd be > nice to have some notice of this somewhere in the docs. > >>From a QC point of view, this situation is a little frustrating, as > rather than write an application with the CDK, I'm having to dig into > code that, from all appearances should be working. I have no problem > with non-working code - I am very happy to dig into code and fix it if > required - but it would be nice if it were noted as such. Yeah, I totally agree with that. I fully support the release-soon philosophy but if something is work in progress or is not fully tested a simple hint in the javadoc makes life a lot easier :) The important point here is "from all appearances should be working". Uli |
From: Egon W. <eg...@us...> - 2005-08-06 14:46:55
|
On Friday 15 July 2005 01:57, Uli Fechner wrote: > Yeah, I totally agree with that. I fully support the release-soon > philosophy but if something is work in progress or is not fully tested a > simple hint in the javadoc makes life a lot easier :) I do want to remind everyone that most in CDK *is* unstable! And it is marked as such [1], but just not explicitely in the source code. The fact that many assume things are not (and stable instead), is a mere consequence that our *unstable* work very well in most cases. Actually, 'unstable' does not really mean, unstable, but not tested to match up with CDK standards. One such test is proper JavaDoc, so when a class is unstable, expect JavaDoc to be incomplete. It is good that someone points to problems like this once more, but it is nothing new. I've said many time, and will once more: Only code in the cdk-core and cdk-data are *stable*. All other code should be used with care: - read JavaDoc, *and* source code of class and JUnit tests Egon 1. http://almost.cubic.uni-koeln.de/cdk/cdk_top/devel/modules/ -- eg...@us... GPG: 1024D/D6336BA6 |
From: <chr...@un...> - 2005-07-15 08:00:00
|
hi, as I remember the hybridisation_atomtypes was implemented by Egon and Matteo for some qsar Descriptors(no general approach only to fit their needs), -before- the AtomHybridisationVSEPR. I have also seen the problems with the atom typing in cdk and started a discussion with Egon, which is stopped by his well deserved holiday. I would suggest to take the *normal* atomtypes and calculate the hybridisations by the corresponding descriptor. So in my opinion the hybridisation_atomtypes is not needed anymore, but I am still not sure about the idea behind it. Currently I try to implement the more detailed mmff94 descriptors in the *normal* cdk atom typing way, like the hybridisation_atomtypes. But this can take some time, so when you need detailed atomtypes use ModelBuilder3d to assign them (mm2 or mmff94) and for hybridisation the AtomHybridisationVSEPRDescriptor. I would not suggest to use hybridisation_atomtypes anymore. This job can be better done by the AtomHybridisationVSEPRDescriptor. Their is a paper to the xlogp: Wang, R, Ying, Fu, & Lai, Luhua, J.Chem. Inf. Comput. Sci., 37:615-621,1997. As I rememeber they have a fragment like approach and the para H Fragment is not be taken into account with the cdk implementation. But I am not quite sure, have to read it again by myself ;). best regards Christian Zitat von Rajarshi Guha <rx...@ps...>: > On Wed, 2005-07-13 at 10:55 -0400, Rajarshi Guha wrote: > > On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote: > > > > > > > > 1) Why do we need the neighbor count, if the bond order sum and > > max bond > > > > orders match > > > In atomtype matching I am not quite sure, but I think it is not > > neccessary. In > > > general you need this to atttach hydrogens to a molecule, when only > > heavy atoms > > > are drawn. > > > > > > > 2) The data file, hybridization_atomtypes.xml does not contain any > > > > neighbor count information (and hence the last value is 0 in the > > debug > > > > output for stored atom types for Cl). Where would this be set for > > these > > > > atom types? Or i s it calculated from the max bond order and bond > > order > > > > sum. In which case, why require matching neighbor counts? > > > No, only the values coded in hybridization_atomtypes, are set in the > > atom. So > > > the 0 is the default if you like. At this moment I would suggest to > > remove the > > > neigbour match. > > > > I've been looking some more at the hybridization_atomtypes.xml file and > I see some inconsistency. > > The Carbon atom types have their formal neighbor counts specified. This > is understandable, since otherwise we could not differentiate between > Cplus.sp2 and Cminus.sp2 Furthermore, the carbon atom types have a line > containing: > > <scalar dataType="xsd:string" dictRef="cdk:hybridization">sp3</scalar> > > However apart from carbon, I don't think the types for other atoms have > neighbor count or hybridization type information added. > > > Now in my previous mail and reply from Christian it was suggested that > we could do away with the check on formal neighbor count. From the above > observation regarding carbon atom types, ignoring formal neighbor count > will not allow us to differentiate certain atom types for carbon. > > So what is the current situation with the hybridization data file and > the associated matcher class? Is this still work in progress? Is there > some other strategy behind the lack of neighbor count info and > hybridization state in the config file? > > >From the code of the matcher it seems that neighbor information should > be in the file. The Javadocs also indicate this: > > "AtomType matcher that deduces the hybridization state of an atom based > on the max bond order, bond order sum and neighbor count properties of > the Atom." > > <rant> > If work on this class and the associated data is not finished, it'd be > nice to have some notice of this somewhere in the docs. > > >From a QC point of view, this situation is a little frustrating, as > rather than write an application with the CDK, I'm having to dig into > code that, from all appearances should be working. I have no problem > with non-working code - I am very happy to dig into code and fix it if > required - but it would be nice if it were noted as such. > > In addition since other code depends on the hybridization atom type data > and matcher, I think its all the more important the such code be marked > non-working/incomplete if it is such. > </rant> > > Apologies for the rant and I hope nobody takes this personally. > > ------------------------------------------------------------------- > Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > The Heineken Uncertainty Principle: > You can never be sure how many beers you had last night. > > > |
From: Matteo F. <flo...@ya...> - 2005-07-18 08:29:34
|
Hi all Their is a paper to the xlogp: Wang, R, Ying, Fu, & Lai, Luhua, As I rememeber they have a fragment like approach and the para H Fragment is not be taken into account with the cdk implementation. But I am not quite sure, have to read it again by myself ;). the documentation is not complete...I mean there is only one example. I'm still waiting an email from Dr Wang with the original source code for a complete validation. Regards, Matteo. _____________ "" L'ana mortu sena piedade sos aguzzinos de su capitale ma non morit sa sua ereditade "" anonimo paulese _____________ --------------------------------- Yahoo! Mail: gratis 1GB per i messaggi, antispam, antivirus, POP3 |
From: Christoph S. <c.s...@un...> - 2005-07-15 22:07:33
|
I perfectly agree with both of you, Rajarshi and Uli. Please understand that CDK was developed by a quite small number of peopl= e for=20 quite some time and thus, these kind of troubles never really bothered us. But right now, that the library takes off, the problem is severe. I would actually dare to state that the atomtype problem is the most seve= re in=20 CDK. And this is due to the fact that fixing it for working on 99% of the= cases,=20 would involve a lot of work, but getting it to run for my current problem= is=20 easy. That is kind of the fundamental problem of Open Source in small com= munities. But anyway, I think your message was received, and it was more about=20 documentation than about non-working code. This is very much appreciated. Cheers, Chris -- Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...) Head of the Research Group for Molecular Informatics Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.d= e) Z=FClpicher Str. 47, 50674 Cologne Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. Uli Fechner wrote: >> If work on this class and the associated data is not finished, it'd be >> nice to have some notice of this somewhere in the docs. >> >>> From a QC point of view, this situation is a little frustrating, as >> >> rather than write an application with the CDK, I'm having to dig into >> code that, from all appearances should be working. I have no problem >> with non-working code - I am very happy to dig into code and fix it if >> required - but it would be nice if it were noted as such. >=20 >=20 > Yeah, I totally agree with that. I fully support the release-soon=20 > philosophy but if something is work in progress or is not fully tested = a=20 > simple hint in the javadoc makes life a lot easier :) >=20 > The important point here is "from all appearances should be working". >=20 > Uli >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dcl= ick > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel >=20 >=20 >=20 |
From: Rajarshi G. <rx...@ps...> - 2005-07-15 22:34:30
|
On Sat, 2005-07-16 at 00:07 +0200, Christoph Steinbeck wrote: > I would actually dare to state that the atomtype problem is the most severe in > CDK. And this is due to the fact that fixing it for working on 99% of the cases, > would involve a lot of work, but getting it to run for my current problem is > easy. That is kind of the fundamental problem of Open Source in small communities. > > But anyway, I think your message was received, and it was more about > documentation than about non-working code. This is very much appreciated. Thats correct - as I said, I have no problem with trying to fix code that does'nt work. I realize that documentation is boring (and many times I've really had to force myself to write up Javadocs!), however the problem is not so much for regular developers on the list, who have a general idea of whats going on. For a developer who needs cheminformatics functionality and turns to the CDK, undocumented features/limitations/todo's etc all detract from the quality of the code. I'm in line with Joerg's view (mentioned before on this and other lists) that we need to consider cheminformatics developement as a software engineering situation. And hence, we need some rigor. I think Egons proposal of a QA team and the limitations module are the beginnings of this type of approach. ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- All science is either physics or stamp collecting. -- Ernest Rutherford |
From: Egon W. <eg...@us...> - 2005-08-06 14:50:55
|
On Saturday 16 July 2005 00:34, Rajarshi Guha wrote: > I'm in line with Joerg's view (mentioned before on this and other lists) > that we need to consider cheminformatics developement as a software > engineering situation. And hence, we need some rigor. I think Egons > proposal of a QA team and the limitations module are the beginnings of > this type of approach. I would point again to my all my efforts over the past 1.5 year or so on getting the quality of the CDK to meet some standards. This is well worked out for quite some time. If interested, please read up in the email archives on how CDK ensures its library quality. And yes, I can use some more people to work with me to cover more than just the data and core modules. My QA team proposal has more to do with practical testing of classes, and has little to do with JavaDoc and source code quality... At least to start with; if that team wants to pick up my previous work, I would be the last to object... Egon -- eg...@us... GPG: 1024D/D6336BA6 |