From: Egon W. <e.w...@sc...> - 2006-01-05 13:25:14
|
Moved discussion to the cdk...@li... ML. On Thursday 05 January 2006 14:15, Uli Fechner wrote: > I would like to propose a protocol for descriptor QA. Please feel free > to comment on this. As the deadline for the upcoming issue of CDKNews is > getting close (January 15th), I would like to come to an agreement on a > descriptor QA protocol until the end of this week. > > Dataset: > - Take complete Set #3 of the ZINC dataset > (http://blaster.docking.org/zinc/bysubset.shtml) > - run a MaxMin selection to yield a diverse subset of 100 000 compounds How large should the subset be? I think 1000 is large enough. > We need to choose a descriptor that is used for the maxmin selection. > This descriptor is not related to the descriptor that is subject to QA! > I propose to take > - EITHER the fingerprinter of CDK > - OR the CATS descriptor of our group here in Frankfurt > > Personally, I prefer the CATS descriptor but even though its computation > is published in detail there is no public implementation available. > Please feel free to comment on this. I prefer an open descriptor. What's the publication? What's your estimate of the time required to make an open source implementation for CATS? > The ZINC website states that it is not allowed to re-distribute data > that is downloaded from their website. In other words, we cannot put our > CDK descriptor QA dataset on the CDK website! Does anyone know one of > the ZINC guys to ask for their permission? We should contact them. I'll send an email right now to the list. > Descriptor validation: > - Calculate descriptor X using the CDK implementation > - Calculate descriptor X using "reference" implementation: MOE, DRAGON; > any suggestions for another "reference" program? Sounds good. > - Detailed comparison of the descriptor values: mean difference, max/min > difference, % compounds w/ <= 10% difference; anything else here? - median difference - ten most different compounds - list of possible causes of differences, e.g. not using the BO data :) > I am very much looking forward to your comments! Me too. E. -- Egon Willighagen http://chem-bla-ics.blogspot.com/ |
From: Uli F. <u.f...@ch...> - 2006-01-07 02:48:58
|
>>- Take the complete set #3 of the ZINC database >>- calculate the CATS descriptor for all compounds in set #3 > > > I think Egon raised this previously - is CATS going to be open source? Most likely yes, even though I cannot say when. This is the only non-open-source program and it does not make me real happy either. But I do not have a better idea; so even though this might not be the best way to go it is the best I came up with. >>- run a MaxMin selection to yield a diverse subset of 10k compounds >>- compute "reference" descriptor with MOE and/or DRAGON >>- compute CDK descriptor >>- compare the "reference" descriptor values with the one of CDK: >>mean/median difference, max/min difference, %compounds w/ <= 10% >>difference, show ten most different compounds and try to find the >>reason, possible causes of different descriptor values > > > All the above sounds OK. However, I'm still not entirely sure we need > 10K structures as opposed to 1k structures, as I'm going to assume that > in 10K structures many of them will have similar features leading to > similar descriptor values (for a given descriptor) > > But if the consensus is for 10K thats OK with me. I agree with you. Let's just take 1k structures. Makes calculations faster and plotting feasible. >>@Rajarhsi: You mentioned "Also plots should be included, see examples in >>the QA repo for the gravitational index and MI descriptors". Sorry, but >>I did not get to what you are referring..? > > > I had uploaded some validation results for the MI and grav. index > descriptors (compared to the ADAPT implementation) to the cdk-qa CVS > repository. I had included the RMSE values as well as the plots of CDK > value vs ADAPT value. Ah, just had a look at that. > The stats you mention above certainly summarize differences between > descriptor implementations, but having plots would quickly allow > readers/users to identify distinct trends (for example the CDK SA > routine underestimates SA for larger molecules etc) Having just looked at the plots I fully agree; thus the reduction to 1k structures (see above). > Another aspect for a validation writeup is factors that can affect > descriptor calculations (as this is where the difference between the CDK > and other implementations will arise, assuming that the actual > algorithms are correctly implemented). > > Thus for example: > > CPSA descriptors require charges & SA > BCUT/WHIM etc require eigenvalues > BCUT/WHIM can use electronegativities > > and so on. > > Thus I think that in addition to explaining the 10 most extreme > deviations, we also need to identify aspects of decsriptor calculations > that will lead to general differences. At a first go these will be: > > charge calculation > surface areas > electronegativities > atomic radii > eigenvalue decomposition (probably not too significant) Ack. I will consider this. Uli |
From: Rajarshi G. <rx...@ps...> - 2006-01-07 06:05:04
|
On Sat, 2006-01-07 at 03:52 +0100, Uli Fechner wrote: > >>- Take the complete set #3 of the ZINC database > >>- calculate the CATS descriptor for all compounds in set #3 > > > > > > I think Egon raised this previously - is CATS going to be open source? > > Most likely yes, even though I cannot say when. This is the only > non-open-source program and it does not make me real happy either. But I > do not have a better idea; so even though this might not be the best way > to go it is the best I came up with. Given that we will be using this descriptor just for selection of a dataset, which will then be publicly available, I think it won't be too much of a problem ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- The Heineken Uncertainty Principle: You can never be sure how many beers you had last night. |
From: Uli F. <u.f...@ch...> - 2006-01-05 13:51:48
|
>>I would like to propose a protocol for descriptor QA. Please feel free >>to comment on this. As the deadline for the upcoming issue of CDKNews is >>getting close (January 15th), I would like to come to an agreement on a >>descriptor QA protocol until the end of this week. >> >>Dataset: >>- Take complete Set #3 of the ZINC dataset >>(http://blaster.docking.org/zinc/bysubset.shtml) >>- run a MaxMin selection to yield a diverse subset of 100 000 compounds > > > How large should the subset be? I think 1000 is large enough. Maybe we can agree on something in between: 10 000 compounds should be large enough to yield meaningful results and small enough to allow for fast computation times. >>We need to choose a descriptor that is used for the maxmin selection. >>This descriptor is not related to the descriptor that is subject to QA! >>I propose to take >>- EITHER the fingerprinter of CDK >>- OR the CATS descriptor of our group here in Frankfurt >> >>Personally, I prefer the CATS descriptor but even though its computation >>is published in detail there is no public implementation available. >>Please feel free to comment on this. > > > I prefer an open descriptor. What's the publication? What's your estimate of > the time required to make an open source implementation for CATS? I prefer on open descriptor, too. Most likely, CATS will go open source somewhen (even though I cannot tell anyhting about the timeframe). I already have a CDK-dependent implementation. If there are reasonable suggestions other than CATS I would like to follow them. >>The ZINC website states that it is not allowed to re-distribute data >>that is downloaded from their website. In other words, we cannot put our >>CDK descriptor QA dataset on the CDK website! Does anyone know one of >>the ZINC guys to ask for their permission? > > > We should contact them. I'll send an email right now to the list. Good. >>Descriptor validation: >>- Calculate descriptor X using the CDK implementation >>- Calculate descriptor X using "reference" implementation: MOE, DRAGON; >>any suggestions for another "reference" program? > > > Sounds good. > > >>- Detailed comparison of the descriptor values: mean difference, max/min >>difference, % compounds w/ <= 10% difference; anything else here? > > > - median difference > - ten most different compounds > - list of possible causes of differences, e.g. not using the BO data :) Fine! Uli |
From: Rajarshi G. <rx...@ps...> - 2006-01-05 16:27:57
|
On Thu, 2006-01-05 at 14:55 +0100, Uli Fechner wrote: > >>I would like to propose a protocol for descriptor QA. Please feel free > >>to comment on this. As the deadline for the upcoming issue of CDKNews is > >>getting close (January 15th), I would like to come to an agreement on a > >>descriptor QA protocol until the end of this week. > >> > >>Dataset: > >>- Take complete Set #3 of the ZINC dataset > >>(http://blaster.docking.org/zinc/bysubset.shtml) > >>- run a MaxMin selection to yield a diverse subset of 100 000 compounds > > > > > > How large should the subset be? I think 1000 is large enough. > > Maybe we can agree on something in between: 10 000 compounds should be > large enough to yield meaningful results and small enough to allow for > fast computation times. I would go with a smaller number. My reasoning is that most descriptors have edge cases which need to be tested. Clearly increasing the size of the molecule pool will increase the probability that all cases are tested - but how much more diverse will the 10K pool be compared to the 1K pool? Either way 10K or 1K is not a significant problem since its just a matter of computation time (though plotting 10K points is not a great idea!) > > I prefer an open descriptor. What's the publication? What's your estimate of > > the time required to make an open source implementation for CATS? > > I prefer on open descriptor, too. Most likely, CATS will go open source > somewhen (even though I cannot tell anyhting about the timeframe). I > already have a CDK-dependent implementation. If there are reasonable > suggestions other than CATS I would like to follow them. I agree on an open descriptor. What about BCUT's? If this were used then, these would have to be validated prior to using it - leading to a chicken and egg problem :( Alternatively fingerprints would be useful (but I seem to recall reading that they can be biased towards smaller compounds when doing library selection using Tanimoto similarity) > >>The ZINC website states that it is not allowed to re-distribute data > >>that is downloaded from their website. In other words, we cannot put our > >>CDK descriptor QA dataset on the CDK website! Does anyone know one of > >>the ZINC guys to ask for their permission? > > > > > > We should contact them. I'll send an email right now to the list. Even if distributing structures is not allowed, we are allowed to distribute the ZINC codes that we used(?) Given the codes, its not too much of a problem to get the structures > >>- Detailed comparison of the descriptor values: mean difference, max/min > >>difference, % compounds w/ <= 10% difference; anything else here? > > > > > > - median difference > > - ten most different compounds > > - list of possible causes of differences, e.g. not using the BO data :) > > Fine! Agreed. (Also plots should be included, see examples in the QA repo for the gravitational index and MI descriptors) ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- Eureka! -- Archimedes |
From: Uli F. <u.f...@ch...> - 2006-01-06 23:43:12
|
Hello together, as there did not show up any further emails on the mailing list regarding our descriptor QA discussion, I would like to summarize the opinions and try to propose a final protocol that - hopefully - is to the satisfaction of all who participated in the discussion. - Take the complete set #3 of the ZINC database - calculate the CATS descriptor for all compounds in set #3 - run a MaxMin selection to yield a diverse subset of 10k compounds - compute "reference" descriptor with MOE and/or DRAGON - compute CDK descriptor - compare the "reference" descriptor values with the one of CDK: mean/median difference, max/min difference, %compounds w/ <= 10% difference, show ten most different compounds and try to find the reason, possible causes of different descriptor values @Rajarhsi: You mentioned "Also plots should be included, see examples in the QA repo for the gravitational index and MI descriptors". Sorry, but I did not get to what you are referring..? Please give a short feedback that the above is acceptable for you or state what you would like to see instead. Uli Rajarshi Guha wrote: > On Thu, 2006-01-05 at 14:55 +0100, Uli Fechner wrote: > >>>>I would like to propose a protocol for descriptor QA. Please feel free >>>>to comment on this. As the deadline for the upcoming issue of CDKNews is >>>>getting close (January 15th), I would like to come to an agreement on a >>>>descriptor QA protocol until the end of this week. >>>> >>>>Dataset: >>>>- Take complete Set #3 of the ZINC dataset >>>>(http://blaster.docking.org/zinc/bysubset.shtml) >>>>- run a MaxMin selection to yield a diverse subset of 100 000 compounds >>> >>> >>>How large should the subset be? I think 1000 is large enough. >> >>Maybe we can agree on something in between: 10 000 compounds should be >>large enough to yield meaningful results and small enough to allow for >>fast computation times. > > > I would go with a smaller number. My reasoning is that most descriptors > have edge cases which need to be tested. Clearly increasing the size of > the molecule pool will increase the probability that all cases are > tested - but how much more diverse will the 10K pool be compared to the > 1K pool? > > Either way 10K or 1K is not a significant problem since its just a > matter of computation time (though plotting 10K points is not a great > idea!) > > >>>I prefer an open descriptor. What's the publication? What's your estimate of >>>the time required to make an open source implementation for CATS? >> >>I prefer on open descriptor, too. Most likely, CATS will go open source >>somewhen (even though I cannot tell anyhting about the timeframe). I >>already have a CDK-dependent implementation. If there are reasonable >>suggestions other than CATS I would like to follow them. > > > I agree on an open descriptor. What about BCUT's? If this were used > then, these would have to be validated prior to using it - leading to a > chicken and egg problem :( > > Alternatively fingerprints would be useful (but I seem to recall reading > that they can be biased towards smaller compounds when doing library > selection using Tanimoto similarity) > > >>>>The ZINC website states that it is not allowed to re-distribute data >>>>that is downloaded from their website. In other words, we cannot put our >>>>CDK descriptor QA dataset on the CDK website! Does anyone know one of >>>>the ZINC guys to ask for their permission? >>> >>> >>>We should contact them. I'll send an email right now to the list. > > > Even if distributing structures is not allowed, we are allowed to > distribute the ZINC codes that we used(?) Given the codes, its not too > much of a problem to get the structures > > >>>>- Detailed comparison of the descriptor values: mean difference, max/min >>>>difference, % compounds w/ <= 10% difference; anything else here? >>> >>> >>>- median difference >>>- ten most different compounds >>>- list of possible causes of differences, e.g. not using the BO data :) >> >>Fine! > > > Agreed. (Also plots should be included, see examples in the QA repo for > the gravitational index and MI descriptors) > > ------------------------------------------------------------------- > Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > Eureka! > -- Archimedes > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click > _______________________________________________ > Cdk-qassurance mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-qassurance > |
From: Egon W. <e.w...@sc...> - 2006-01-06 23:48:45
|
On Saturday 07 January 2006 00:46, Uli Fechner wrote: > @Rajarhsi: You mentioned "Also plots should be included, see examples in > the QA repo for the gravitational index and MI descriptors". Sorry, but > I did not get to what you are referring..? He was refering to the CDK CVS module 'cdk-qa'... That's where we put scripts and results... Egon -- e.w...@sc... PhD student on Molecular Representation in Chemometrics Radboud University Nijmegen Blog: http://chem-bla-ics.blogspot.com/ http://www.cac.science.ru.nl/people/egonw/ GPG: 1024D/D6336BA6 |
From: Rajarshi G. <rx...@ps...> - 2006-01-06 23:54:37
|
On Sat, 2006-01-07 at 00:46 +0100, Uli Fechner wrote: > as there did not show up any further emails on the mailing list > regarding our descriptor QA discussion, I would like to summarize the > opinions and try to propose a final protocol that - hopefully - is to > the satisfaction of all who participated in the discussion. Sorry for the silence - just got back from vacation! > > - Take the complete set #3 of the ZINC database > - calculate the CATS descriptor for all compounds in set #3 I think Egon raised this previously - is CATS going to be open source? > - run a MaxMin selection to yield a diverse subset of 10k compounds > - compute "reference" descriptor with MOE and/or DRAGON > - compute CDK descriptor > - compare the "reference" descriptor values with the one of CDK: > mean/median difference, max/min difference, %compounds w/ <= 10% > difference, show ten most different compounds and try to find the > reason, possible causes of different descriptor values All the above sounds OK. However, I'm still not entirely sure we need 10K structures as opposed to 1k structures, as I'm going to assume that in 10K structures many of them will have similar features leading to similar descriptor values (for a given descriptor) But if the consensus is for 10K thats OK with me. > @Rajarhsi: You mentioned "Also plots should be included, see examples in > the QA repo for the gravitational index and MI descriptors". Sorry, but > I did not get to what you are referring..? I had uploaded some validation results for the MI and grav. index descriptors (compared to the ADAPT implementation) to the cdk-qa CVS repository. I had included the RMSE values as well as the plots of CDK value vs ADAPT value. The stats you mention above certainly summarize differences between descriptor implementations, but having plots would quickly allow readers/users to identify distinct trends (for example the CDK SA routine underestimates SA for larger molecules etc) Another aspect for a validation writeup is factors that can affect descriptor calculations (as this is where the difference between the CDK and other implementations will arise, assuming that the actual algorithms are correctly implemented). Thus for example: CPSA descriptors require charges & SA BCUT/WHIM etc require eigenvalues BCUT/WHIM can use electronegativities and so on. Thus I think that in addition to explaining the 10 most extreme deviations, we also need to identify aspects of decsriptor calculations that will lead to general differences. At a first go these will be: charge calculation surface areas electronegativities atomic radii eigenvalue decomposition (probably not too significant) ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- C Code. C Code Run. Run, Code, RUN! PLEASE!!!! |