Thread: [Cdk-devel] question about HybridizationStateATMatcher

Brought to you by: egonw, jwmay, rajarshi, steinbeck

cdk-devel

[Cdk-devel] question about HybridizationStateATMatcher

From: Rajarshi G. <rx...@ps...> - 2005-07-12 20:39:10

Hi, I was trying to calculate the AtomHybridizationDescriptor for the
following molecule:

dan002.sdf
  MOE2004           3D

  5  4  0  0  0  0  0  0  0  0999 V2000
   -1.7000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0500    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.5500    0.4580    1.1740 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.5500   -1.2450   -0.1900 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.5500    0.7880   -0.9830 F   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  2  4  1  0  0  0  0
  2  5  1  0  0  0  0
M  END
$$$$

In the code for the descriptor there are two lines:

atm = new HybridizationStateATMatcher();
matched = atm.findMatchingAtomType(container, atom);

The problem is that for the Cl atom, matched is NULL.

The CDK debug log contains:

My ATOM TYPE Cl 1.0 1.0 1
0ATOM TYPE 1.0 1.0 0
1ATOM TYPE 0.0 0.0 0

(formatted for ease of reading).

Now, the first line indicates the 

symbol, bond order sum, max bond order and connected atom count

for the atom in question and the next 2 lines indicate the possible atom
types from the config file. 

Now, the data file for hybridization atom types contains two possible
types for Cl - one for neutral Cl (connected by a single bond to some
other atom) and one for an anionic Cl.

Clearly, the Cl in the above molecule matches the first type.

However the code in HybridizationStateATMatcher returns a successfull
match only when the 

bond order sum
max bond order
*and* the neighbor count all match.

The neighbor count is obtained by calling getFormalNeighborCount() of
the AtomType object returned by AtomTypeFactory.getInstance().

My question is

1) Why do we need the neighbor count, if the bond order sum and max bond
orders match

2) The data file, hybridization_atomtypes.xml does not contain any
neighbor count information (and hence the last value is 0 in the debug
output for stored atom types for Cl). Where would this be set for these
atom types? Or is it calculated from the max bond order and bond order
sum. In which case, why require matching neighbor counts?

Thanks,

-------------------------------------------------------------------
Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Eureka!
-- Archimedes

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Rajarshi G. <rx...@ps...> - 2005-07-13 14:55:21

On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote:
> >
> > 1) Why do we need the neighbor count, if the bond order sum and max bond
> > orders match
> In atomtype matching I am not quite sure, but I think it is not neccessary. In
> general you need this to atttach hydrogens to a molecule, when only heavy atoms
> are drawn.
> 
> > 2) The data file, hybridization_atomtypes.xml does not contain any
> > neighbor count information (and hence the last value is 0 in the debug
> > output for stored atom types for Cl). Where would this be set for these
> > atom types? Or is it calculated from the max bond order and bond order
> > sum. In which case, why require matching neighbor counts?
> No, only the values coded in hybridization_atomtypes, are set in the atom. So
> the 0 is the default if you like. At this moment I would suggest to remove the
> neigbour match.

Thanks for the info - I'll go ahead and update CVS with this change

-------------------------------------------------------------------
Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
667:
The neighbor of the beast.

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Rajarshi G. <rx...@ps...> - 2005-07-14 22:21:25

On Wed, 2005-07-13 at 10:55 -0400, Rajarshi Guha wrote:
> On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote:
> > >
> > > 1) Why do we need the neighbor count, if the bond order sum and
> max bond
> > > orders match
> > In atomtype matching I am not quite sure, but I think it is not
> neccessary. In
> > general you need this to atttach hydrogens to a molecule, when only
> heavy atoms
> > are drawn.
> > 
> > > 2) The data file, hybridization_atomtypes.xml does not contain any
> > > neighbor count information (and hence the last value is 0 in the
> debug
> > > output for stored atom types for Cl). Where would this be set for
> these
> > > atom types? Or is it calculated from the max bond order and bond
> order
> > > sum. In which case, why require matching neighbor counts?
> > No, only the values coded in hybridization_atomtypes, are set in the
> atom. So
> > the 0 is the default if you like. At this moment I would suggest to
> remove the
> > neigbour match.
> 

I've been looking some more at the hybridization_atomtypes.xml file and
I see some inconsistency.

The Carbon atom types have their formal neighbor counts specified. This
is understandable, since otherwise we could not differentiate between
Cplus.sp2 and Cminus.sp2 Furthermore, the carbon atom types have a line
containing:

<scalar dataType="xsd:string" dictRef="cdk:hybridization">sp3</scalar>

However apart from carbon, I don't think the types for other atoms have
neighbor count or hybridization type information added.

Now in my previous mail and reply from Christian it was suggested that
we could do away with the check on formal neighbor count. From the above
observation regarding carbon atom types, ignoring formal neighbor count
will not allow us to differentiate certain atom types for carbon.

So what is the current situation with the hybridization data file and
the associated matcher class? Is this still work in progress? Is there
some other strategy behind the lack of neighbor count info and
hybridization state in the config file? 

>From the code of the matcher it seems that neighbor information should
be in the file. The Javadocs also indicate this:

"AtomType matcher that deduces the hybridization state of an atom based
on the max bond order, bond order sum and neighbor count properties of
the Atom."

<rant>
If work on this class and the associated data is not finished, it'd be
nice to have some notice of this somewhere in the docs. 

>From a QC point of view, this situation is a little frustrating, as
rather than write an application with the CDK, I'm having to dig into
code that, from all appearances should be working. I have no problem
with non-working code - I am very happy to dig into code and fix it if
required - but it would be nice if it were noted as such.

In addition since other code depends on the hybridization atom type data
and matcher, I think its all the more important the such code be marked
non-working/incomplete if it is such.
</rant>

Apologies for the rant and I hope nobody takes this personally.

-------------------------------------------------------------------
Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
The Heineken Uncertainty Principle:
You can never be sure how many beers you had last night.

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Uli F. <u.f...@ch...> - 2005-07-14 23:55:05

> If work on this class and the associated data is not finished, it'd be
> nice to have some notice of this somewhere in the docs. 
> 
>>From a QC point of view, this situation is a little frustrating, as
> rather than write an application with the CDK, I'm having to dig into
> code that, from all appearances should be working. I have no problem
> with non-working code - I am very happy to dig into code and fix it if
> required - but it would be nice if it were noted as such.

Yeah, I totally agree with that. I fully support the release-soon 
philosophy but if something is work in progress or is not fully tested a 
simple hint in the javadoc makes life a lot easier :)

The important point here is "from all appearances should be working".

Uli

[Cdk-devel] Stable vs. Unstable (was: question about HybridizationStateATMatcher)

From: Egon W. <eg...@us...> - 2005-08-06 14:46:55

On Friday 15 July 2005 01:57, Uli Fechner wrote:
> Yeah, I totally agree with that. I fully support the release-soon
> philosophy but if something is work in progress or is not fully tested a
> simple hint in the javadoc makes life a lot easier :)

I do want to remind everyone that most in CDK *is* unstable! And it is marked 
as such [1], but just not explicitely in the source code.

The fact that many assume things are not (and stable instead), is a mere 
consequence that our *unstable* work very well in most cases. Actually, 
'unstable' does not really mean, unstable, but not tested to match up with 
CDK standards. One such test is proper JavaDoc, so when a class is unstable, 
expect JavaDoc to be incomplete.

It is good that someone points to problems like this once more, but it is 
nothing new. I've said many time, and will once more:

Only code in the cdk-core and cdk-data are *stable*.

All other code should be used with care:
- read JavaDoc, *and* source code of class and JUnit tests

Egon

1. http://almost.cubic.uni-koeln.de/cdk/cdk_top/devel/modules/
-- 
eg...@us...
GPG: 1024D/D6336BA6

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: <chr...@un...> - 2005-07-15 08:00:00

hi,

as I remember the hybridisation_atomtypes was implemented by Egon and Matteo for
some qsar Descriptors(no general approach only to fit their needs), -before- the
AtomHybridisationVSEPR. I have also seen the problems with the atom typing in
cdk and started a discussion with Egon, which is stopped by his well deserved
holiday. I would suggest to take the *normal* atomtypes and calculate the
hybridisations by the corresponding descriptor. So in my opinion the
hybridisation_atomtypes is not needed anymore, but I am still not sure about
the idea behind it. Currently I try to implement the more detailed mmff94
descriptors in the *normal* cdk atom typing way, like the
hybridisation_atomtypes. But this can take some time, so when you need detailed
atomtypes use ModelBuilder3d to assign them (mm2 or mmff94) and for
hybridisation the AtomHybridisationVSEPRDescriptor. I would not suggest to use
hybridisation_atomtypes anymore. This job can be better done by the
AtomHybridisationVSEPRDescriptor.

Their is a paper to the xlogp: Wang, R, Ying, Fu, & Lai, Luhua, J.Chem. Inf.
Comput. Sci., 37:615-621,1997.
As I rememeber they have a fragment like approach and the para H Fragment is not
be taken into account with the cdk implementation. But I am not quite sure, have
to read it again by myself ;).


best regards
Christian


Zitat von Rajarshi Guha <rx...@ps...>:

> On Wed, 2005-07-13 at 10:55 -0400, Rajarshi Guha wrote:
> > On Wed, 2005-07-13 at 09:00 +0200, chr...@un... wrote:
> > > >
> > > > 1) Why do we need the neighbor count, if the bond order sum and
> > max bond
> > > > orders match
> > > In atomtype matching I am not quite sure, but I think it is not
> > neccessary. In
> > > general you need this to atttach hydrogens to a molecule, when only
> > heavy atoms
> > > are drawn.
> > >
> > > > 2) The data file, hybridization_atomtypes.xml does not contain any
> > > > neighbor count information (and hence the last value is 0 in the
> > debug
> > > > output for stored atom types for Cl). Where would this be set for
> > these
> > > > atom types? Or i
s it calculated from the max bond order and bond
> > order
> > > > sum. In which case, why require matching neighbor counts?
> > > No, only the values coded in hybridization_atomtypes, are set in the
> > atom. So
> > > the 0 is the default if you like. At this moment I would suggest to
> > remove the
> > > neigbour match.
> >
>
> I've been looking some more at the hybridization_atomtypes.xml file and
> I see some inconsistency.
>
> The Carbon atom types have their formal neighbor counts specified. This
> is understandable, since otherwise we could not differentiate between
> Cplus.sp2 and Cminus.sp2 Furthermore, the carbon atom types have a line
> containing:
>
> <scalar dataType="xsd:string" dictRef="cdk:hybridization">sp3</scalar>
>
> However apart from carbon, I don't think the types for other atoms have
> neighbor count or hybridization type information added.
>
>
> Now in my previous mail and reply from Christian it was suggested that
> we could do away with the check on formal neighbor count. From the above
> observation regarding carbon atom types, ignoring formal neighbor count
> will not allow us to differentiate certain atom types for carbon.
>
> So what is the current situation with the hybridization data file and
> the associated matcher class? Is this still work in progress? Is there
> some other strategy behind the lack of neighbor count info and
> hybridization state in the config file?
>
> >From the code of the matcher it seems that neighbor information should
> be in the file. The Javadocs also indicate this:
>
> "AtomType matcher that deduces the hybridization state of an atom based
> on the max bond order, bond order sum and neighbor count properties of
> the Atom."
>
> <rant>
> If work on this class and the associated data is not finished, it'd be
> nice to have some notice of this somewhere in the docs.
>
> >From a QC point of view, this situation is a little frustrating, as
> rather than write an application with the CDK, I'm having to dig into
> code that, from all appearances should be working. I have no problem
> with non-working code - I am very happy to dig into code and fix it if
> required - but it would be nice if it were noted as such.
>
> In addition since other code depends on the hybridization atom type data
> and matcher, I think its all the more important the such code be marked
> non-working/incomplete if it is such.
> </rant>
>
> Apologies for the rant and I hope nobody takes this personally.
>
> -------------------------------------------------------------------
> Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> The Heineken Uncertainty Principle:
> You can never be sure how many beers you had last night.
>
>
>

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Matteo F. <flo...@ya...> - 2005-07-18 08:29:34

Hi all




Their is a paper to the xlogp: Wang, R, Ying, Fu, & Lai, Luhua, As I rememeber they have a fragment like approach and the para H Fragment is not
be taken into account with the cdk implementation. But I am not quite sure, have
to read it again by myself ;).

the documentation is not complete...I mean there is only one example. I'm still waiting an email from Dr Wang with the original source code for a complete validation.

Regards, 

Matteo.



_____________

"" L'ana mortu sena piedade
sos aguzzinos de su capitale
ma non morit
sa sua ereditade ""

      anonimo paulese
_____________
		
---------------------------------
Yahoo! Mail: gratis 1GB per i messaggi, antispam, antivirus, POP3

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Christoph S. <c.s...@un...> - 2005-07-15 22:07:33

I perfectly agree with both of you, Rajarshi and Uli.
Please understand that CDK was developed by a quite small number of peopl=
e for=20
quite some time and thus, these kind of troubles never really bothered us.
But right now, that the library takes off, the problem is severe.

I would actually dare to state that the atomtype problem is the most seve=
re in=20
CDK. And this is due to the fact that fixing it for working on 99% of the=
 cases,=20
would involve a lot of work, but getting it to run for my current problem=
 is=20
easy. That is kind of the fundamental problem of Open Source in small com=
munities.

But anyway, I think your message was received, and it was more about=20
documentation than about non-working code. This is very much appreciated.

Cheers,

Chris

--
Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...)
Head of the Research Group for Molecular Informatics
Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.d=
e)
Z=FClpicher Str. 47, 50674 Cologne
Tel: +49(0)221-470-7426   Fax: +49 (0) 221-470-7786

What is man but that lofty spirit - that sense of enterprise.
... Kirk, "I, Mudd," stardate 4513.3..

Uli Fechner wrote:
>> If work on this class and the associated data is not finished, it'd be
>> nice to have some notice of this somewhere in the docs.
>>
>>> From a QC point of view, this situation is a little frustrating, as
>>
>> rather than write an application with the CDK, I'm having to dig into
>> code that, from all appearances should be working. I have no problem
>> with non-working code - I am very happy to dig into code and fix it if
>> required - but it would be nice if it were noted as such.
>=20
>=20
> Yeah, I totally agree with that. I fully support the release-soon=20
> philosophy but if something is work in progress or is not fully tested =
a=20
> simple hint in the javadoc makes life a lot easier :)
>=20
> The important point here is "from all appearances should be working".
>=20
> Uli
>=20
>=20
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dcl=
ick
> _______________________________________________
> Cdk-devel mailing list
> Cdk...@li...
> https://lists.sourceforge.net/lists/listinfo/cdk-devel
>=20
>=20
>=20

Re: [Cdk-devel] question about HybridizationStateATMatcher

From: Rajarshi G. <rx...@ps...> - 2005-07-15 22:34:30

On Sat, 2005-07-16 at 00:07 +0200, Christoph Steinbeck wrote:

> I would actually dare to state that the atomtype problem is the most severe in 
> CDK. And this is due to the fact that fixing it for working on 99% of the cases, 
> would involve a lot of work, but getting it to run for my current problem is 
> easy. That is kind of the fundamental problem of Open Source in small communities.
> 
> But anyway, I think your message was received, and it was more about 
> documentation than about non-working code. This is very much appreciated.

Thats correct - as I said, I have no problem with trying to fix code
that does'nt work. 

I realize that documentation is boring (and many times I've really had
to force myself to write up Javadocs!), however the problem is not so
much for regular developers on the list, who have a general idea of
whats going on. 

For a developer who needs cheminformatics functionality and turns to the
CDK, undocumented features/limitations/todo's etc all detract from the
quality of the code.

I'm in line with Joerg's view (mentioned before on this and other lists)
that we need to consider cheminformatics developement as a software
engineering situation. And hence, we need some rigor. I think Egons
proposal of a QA team and the limitations module are the beginnings of
this type of approach.

-------------------------------------------------------------------
Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
All science is either physics or stamp collecting.
-- Ernest Rutherford

[Cdk-devel] Re: question about HybridizationStateATMatcher

From: Egon W. <eg...@us...> - 2005-08-06 14:50:55

On Saturday 16 July 2005 00:34, Rajarshi Guha wrote:
> I'm in line with Joerg's view (mentioned before on this and other lists)
> that we need to consider cheminformatics developement as a software
> engineering situation. And hence, we need some rigor. I think Egons
> proposal of a QA team and the limitations module are the beginnings of
> this type of approach.

I would point again to my all my efforts over the past 1.5 year or so on 
getting the quality of the CDK to meet some standards. This is well worked 
out for quite some time.

If interested, please read up in the email archives on how CDK ensures its 
library quality. And yes, I can use some more people to work with me to cover 
more than just the data and core modules.

My QA team proposal has more to do with practical testing of classes, and has 
little to do with JavaDoc and source code quality... At least to start with; 
if that team wants to pick up my previous work, I would be the last to 
object...

Egon

-- 
eg...@us...
GPG: 1024D/D6336BA6