You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(1) |
Aug
(5) |
Sep
|
Oct
|
Nov
|
Dec
(11) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(20) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Rajarshi G. <rx...@ps...> - 2006-06-23 11:26:21
|
On Fri, 2006-06-23 at 11:08 +0200, Egon Willighagen wrote: > Hi all, > > Tobias and I are working on a test suite for CDK's IO classes, and I just > finished a test for the XYZReader/Writer combo. Like with all our QA > projects, it's using the ZINC db test files. > > For now, I've put the BeanShell script in cdk-qa/projects/060623-0001/. > Rajarshi, if I need to change the project number, please let me know. The number is fine ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- "whois awk?", sed Grep. |
From: Egon W. <ewi...@un...> - 2006-06-23 09:07:53
|
Hi all, Tobias and I are working on a test suite for CDK's IO classes, and I just finished a test for the XYZReader/Writer combo. Like with all our QA projects, it's using the ZINC db test files. For now, I've put the BeanShell script in cdk-qa/projects/060623-0001/. Rajarshi, if I need to change the project number, please let me know. Egon -- CUBIC blog: http://chem-bla-ics.blogspot.com/ |
From: peter murray-r. <pm...@ca...> - 2006-01-23 15:49:03
|
At 09:07 23/01/2006, Christoph Steinbeck wrote: >>This looks very good. >>A general observation. In difficult cases I=20 >>would prefer a diagram to be uninterpretable=20 >>(or only interpretable with difficulty) than to=20 >>be misinterpretable. Two common examples are=20 >>(a) when two atoms are superimposed and (b)=20 >>when a carbon has two ligands at 180 degrees.=20 >>In the first instance can the atoms be slightly=20 >>shifted even if the rings are messy? and in the=20 >>second a bent bond, or a "C" or dot at the atom position. > >Peter, > >about a year or two ago, I've built in the function to resolve overlap. >Not sure if it was switched on in this case=20 >(guess it should be on by default.) >Did you see any examples in the PDF (ZINC number?) where instance 1= happens? I thought I saw several. Mainly strange oxygen coordinations P. >Cheers, > >Chris > >-- >Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...) >Head of the Research Group for Molecular Informatics >Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.de) >Z=FClpicher Str. 47, 50674 Cologne >Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 > >What is man but that lofty spirit - that sense of enterprise. >... Kirk, "I, Mudd," stardate 4513.3.. Peter Murray-Rust Unilever Centre for Molecular Sciences Informatics University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK +44-1223-763069=20 |
From: Christoph S. <c.s...@un...> - 2006-01-23 09:07:32
|
> This looks very good. >=20 > A general observation. In difficult cases I would prefer a diagram to b= e=20 > uninterpretable (or only interpretable with difficulty) than to be=20 > misinterpretable. Two common examples are (a) when two atoms are=20 > superimposed and (b) when a carbon has two ligands at 180 degrees. In=20 > the first instance can the atoms be slightly shifted even if the rings=20 > are messy? and in the second a bent bond, or a "C" or dot at the atom=20 > position. Peter, about a year or two ago, I've built in the function to resolve overlap. Not sure if it was switched on in this case (guess it should be on by def= ault.) Did you see any examples in the PDF (ZINC number?) where instance 1 happe= ns? Cheers, Chris --=20 Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...) Head of the Research Group for Molecular Informatics Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.d= e) Z=FClpicher Str. 47, 50674 Cologne Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |
From: Rajarshi G. <rx...@ps...> - 2006-01-22 20:08:48
|
On Sun, 2006-01-22 at 17:39 +0100, Christoph Steinbeck wrote: > Egon, Rajarshi, > > this PDF is fantastic! > It really shows where the problems are. > I love this pretty new world with all this relevant structures (ZINC, Pubchem) > available for testing things! > As soon as I can, I'll go into fixing these things. Thanks for the fix. I've updated the code so that you can specify how many molecule columns should be generated. I also put up some examples. I have to say, the SDG is very neat! The output (3 structure column) looks very professional! ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- "I'd love to go out with you, but my favorite commercial is on TV." |
From: peter murray-r. <pm...@ca...> - 2006-01-22 17:35:14
|
At 16:39 22/01/2006, Christoph Steinbeck wrote: >Egon, Rajarshi, > >this PDF is fantastic! >It really shows where the problems are. >I love this pretty new world with all this=20 >relevant structures (ZINC, Pubchem) available for testing things! >As soon as I can, I'll go into fixing these things. > >And kudos to Rajarshi for writing such a helpful piece of code. > >Cheers, > >Chris > >Egon Willighagen wrote: >>On Saturday 21 January 2006 20:04, Egon Willighagen wrote: >> >>>On Saturday 21 January 2006 16:40, Christoph Steinbeck wrote: >>>JCP 2.1.7 (in jcp21 branch) has a bug fix for this. It has not been= ported >>>to HEAD yet. I'll try to port it right now, as I actually need 2D layout >>>working again in HEAD... >> >>I just applied the two line fix, and just=20 >>commited this to CVS. I've converted the ZINC=20 >>SDF test file with 1000 compounds to a PDF with=20 >>table, using Rajarshi's draw2.java [1], and most compounds look rather= good. >>One annoyance are the sulphates and phosphates=20 >>with one would expect to be drawn with 90=20 >>degrees angled, and the two double bonded oxygens to the sides. >>BTW, another good use of the ZINC test data set, see the cdk-qa ML. >>The result: >> http://www.woc.science.ru.nl/devel/egonw/zinc.pdf >>Egon >>1.http://blue.chem.psu.edu/~rajarshi/code/java/#draw2d This looks very good. A general observation. In difficult cases I would=20 prefer a diagram to be uninterpretable (or only=20 interpretable with difficulty) than to be=20 misinterpretable. Two common examples are (a)=20 when two atoms are superimposed and (b) when a=20 carbon has two ligands at 180 degrees. In the=20 first instance can the atoms be slightly shifted=20 even if the rings are messy? and in the second a=20 bent bond, or a "C" or dot at the atom position. P. >-- >Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...) >Head of the Research Group for Molecular Informatics >Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.de) >Z=FClpicher Str. 47, 50674 Cologne >Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 > >What is man but that lofty spirit - that sense of enterprise. >... Kirk, "I, Mudd," stardate 4513.3.. > > >------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. Do you grep through log= files >for problems? Stop! Download the new AJAX search engine that makes >searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >http://sel.as-us.falkag.net/sel?cmdk&kid3432&bid#0486&dat1642 >_______________________________________________ >Cdk-devel mailing list >Cdk...@li... >https://lists.sourceforge.net/lists/listinfo/cdk-devel Peter Murray-Rust Unilever Centre for Molecular Sciences Informatics University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK +44-1223-763069=20 |
From: Christoph S. <c.s...@un...> - 2006-01-22 16:39:39
|
Egon, Rajarshi, this PDF is fantastic! It really shows where the problems are. I love this pretty new world with all this relevant structures (ZINC, Pub= chem)=20 available for testing things! As soon as I can, I'll go into fixing these things. And kudos to Rajarshi for writing such a helpful piece of code. Cheers, Chris Egon Willighagen wrote: > On Saturday 21 January 2006 20:04, Egon Willighagen wrote: >=20 >>On Saturday 21 January 2006 16:40, Christoph Steinbeck wrote: >>JCP 2.1.7 (in jcp21 branch) has a bug fix for this. It has not been por= ted >>to HEAD yet. I'll try to port it right now, as I actually need 2D layou= t >>working again in HEAD... >=20 >=20 > I just applied the two line fix, and just commited this to CVS. I've co= nverted=20 > the ZINC SDF test file with 1000 compounds to a PDF with table, using=20 > Rajarshi's draw2.java [1], and most compounds look rather good. >=20 > One annoyance are the sulphates and phosphates with one would expect to= be=20 > drawn with 90 degrees angled, and the two double bonded oxygens to the = sides. > BTW, another good use of the ZINC test data set, see the cdk-qa ML. >=20 > The result: >=20 > http://www.woc.science.ru.nl/devel/egonw/zinc.pdf >=20 > Egon >=20 > 1.http://blue.chem.psu.edu/~rajarshi/code/java/#draw2d >=20 --=20 Priv. Doz. Dr. Christoph Steinbeck (c.s...@un...) Head of the Research Group for Molecular Informatics Cologne University BioInformatics Center (http://almost.cubic.uni-koeln.d= e) Z=FClpicher Str. 47, 50674 Cologne Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |
From: Egon W. <eg...@us...> - 2006-01-22 15:33:48
|
On Saturday 21 January 2006 20:04, Egon Willighagen wrote: > On Saturday 21 January 2006 16:40, Christoph Steinbeck wrote: > JCP 2.1.7 (in jcp21 branch) has a bug fix for this. It has not been ported > to HEAD yet. I'll try to port it right now, as I actually need 2D layout > working again in HEAD... I just applied the two line fix, and just commited this to CVS. I've converted the ZINC SDF test file with 1000 compounds to a PDF with table, using Rajarshi's draw2.java [1], and most compounds look rather good. One annoyance are the sulphates and phosphates with one would expect to be drawn with 90 degrees angled, and the two double bonded oxygens to the sides. BTW, another good use of the ZINC test data set, see the cdk-qa ML. The result: http://www.woc.science.ru.nl/devel/egonw/zinc.pdf Egon 1.http://blue.chem.psu.edu/~rajarshi/code/java/#draw2d -- eg...@us... Blog: http://chem-bla-ics.blogspot.com/ GPG: 1024D/D6336BA6 |
From: Rajarshi G. <rx...@ps...> - 2006-01-19 21:46:54
|
On Thu, 2006-01-19 at 21:05 +0100, Uli Fechner wrote: > Rajarshi, could you please tell me the ZINC identifiers of the molecules > corina was not able to process? ZINC00033713 ZINC00644893 ZINC03138819 ZINC03618623 ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- Q: Why did the mathematician name his dog "Cauchy"? A: Because he left a residue at every pole. |
From: Uli F. <u.f...@ch...> - 2006-01-19 20:02:53
|
I will provide four datasets that contains 1000 structures (with hydrogens and with charges, w/o hydrogens and w/ charges, w/ hydrogens and w/o charges, w/o hydrogens and w/o charges). Unfortunately, this will take some days, because I am away from Friday to Monday. This will delay the writing of the two CDKNews articles that rely on the descriptor QA dataset. I hope that I can make it until the end of next week. Rajarshi, could you please tell me the ZINC identifiers of the molecules corina was not able to process? Uli Rajarshi Guha wrote: > On Thu, 2006-01-19 at 20:12 +0100, Egon Willighagen wrote: >> On Thursday 19 January 2006 20:08, Rajarshi Guha wrote: >>> On Thu, 2006-01-19 at 19:55 +0100, Egon Willighagen wrote: >>>> I'm glad to see that the test set has been added to CVS, and that new >>>> results get uploaded. >>>> >>>> I saw that Rajarshi removed a few molecules, that could not be converted >>>> into 3D models with Corina. So, I guess these (how many are there?) >>> 4 got dropped (serials 9, 253, 879, 963 from the original set that Uli >>> sent me) >>> >>>> should be >>>> replaced by new molecules. Uli, can you create a list of say, 100 back >>>> ups? >>>> >>>> BTW, I promised to report the list of ZINC ids to the ZINC-developers, so >>>> that they know which molecules we used. >>> projects/050501-0001/zinc_ids.txt in CVS contains the ID's of the 996 >>> molecules that remain >>> >>> BTW, looks like I can't (reliably) use ADAPT to generate comparison data >>> for some descriptors because it won't handle charged species. I recall >>> that DRAGON calculated a number of ADAPT descriptors but I don't have >>> access to it. I've placed the CDK generated data for 3 descriptors in >>> the CVS >> Are there many charged species? > > 607 have a M CHG entry > >> Or we should make a third alternative, one >> with hydrogens, but without charges. > > Might be a good idea - though if DRAGON can handle charged species I > don't think we need bother. I only faced this problem because ADAPT is > *old* and nobody has fiddled with the internal data structures for 10-15 > years > >> BTW, two weeks ago a learned that Dragon has trouble with molecules with more >> than 300 atoms :) How does this work with our test data set? > > We're good on atom count, max value is 81. I've attached histograms of > MW and atom count > > ------------------------------------------------------------------- > Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > All theoretical chemistry is really physics; and all theoretical > chemists > know it. > -- Richard P. Feynman > > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > |
From: Rajarshi G. <rx...@ps...> - 2006-01-19 19:20:28
|
On Thu, 2006-01-19 at 20:12 +0100, Egon Willighagen wrote: > On Thursday 19 January 2006 20:08, Rajarshi Guha wrote: > > On Thu, 2006-01-19 at 19:55 +0100, Egon Willighagen wrote: > > > I'm glad to see that the test set has been added to CVS, and that new > > > results get uploaded. > > > > > > I saw that Rajarshi removed a few molecules, that could not be converted > > > into 3D models with Corina. So, I guess these (how many are there?) > > > > 4 got dropped (serials 9, 253, 879, 963 from the original set that Uli > > sent me) > > > > > should be > > > replaced by new molecules. Uli, can you create a list of say, 100 back > > > ups? > > > > > > BTW, I promised to report the list of ZINC ids to the ZINC-developers, so > > > that they know which molecules we used. > > > > projects/050501-0001/zinc_ids.txt in CVS contains the ID's of the 996 > > molecules that remain > > > > BTW, looks like I can't (reliably) use ADAPT to generate comparison data > > for some descriptors because it won't handle charged species. I recall > > that DRAGON calculated a number of ADAPT descriptors but I don't have > > access to it. I've placed the CDK generated data for 3 descriptors in > > the CVS > > Are there many charged species? 607 have a M CHG entry > Or we should make a third alternative, one > with hydrogens, but without charges. Might be a good idea - though if DRAGON can handle charged species I don't think we need bother. I only faced this problem because ADAPT is *old* and nobody has fiddled with the internal data structures for 10-15 years > BTW, two weeks ago a learned that Dragon has trouble with molecules with more > than 300 atoms :) How does this work with our test data set? We're good on atom count, max value is 81. I've attached histograms of MW and atom count ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- All theoretical chemistry is really physics; and all theoretical chemists know it. -- Richard P. Feynman |
From: Egon W. <e.w...@sc...> - 2006-01-19 19:12:16
|
On Thursday 19 January 2006 20:08, Rajarshi Guha wrote: > On Thu, 2006-01-19 at 19:55 +0100, Egon Willighagen wrote: > > I'm glad to see that the test set has been added to CVS, and that new > > results get uploaded. > > > > I saw that Rajarshi removed a few molecules, that could not be converted > > into 3D models with Corina. So, I guess these (how many are there?) > > 4 got dropped (serials 9, 253, 879, 963 from the original set that Uli > sent me) > > > should be > > replaced by new molecules. Uli, can you create a list of say, 100 back > > ups? > > > > BTW, I promised to report the list of ZINC ids to the ZINC-developers, so > > that they know which molecules we used. > > projects/050501-0001/zinc_ids.txt in CVS contains the ID's of the 996 > molecules that remain > > BTW, looks like I can't (reliably) use ADAPT to generate comparison data > for some descriptors because it won't handle charged species. I recall > that DRAGON calculated a number of ADAPT descriptors but I don't have > access to it. I've placed the CDK generated data for 3 descriptors in > the CVS Are there many charged species? Or we should make a third alternative, one with hydrogens, but without charges. BTW, two weeks ago a learned that Dragon has trouble with molecules with more than 300 atoms :) How does this work with our test data set? Egon -- e.w...@sc... PhD student on Molecular Representation in Chemometrics Radboud University Nijmegen Blog: http://chem-bla-ics.blogspot.com/ http://www.cac.science.ru.nl/people/egonw/ GPG: 1024D/D6336BA6 |
From: Rajarshi G. <rx...@ps...> - 2006-01-19 19:07:52
|
On Thu, 2006-01-19 at 19:55 +0100, Egon Willighagen wrote: > Hi Uli/Rajarshi, > > I'm glad to see that the test set has been added to CVS, and that new results > get uploaded. > > I saw that Rajarshi removed a few molecules, that could not be converted into > 3D models with Corina. So, I guess these (how many are there?) 4 got dropped (serials 9, 253, 879, 963 from the original set that Uli sent me) > should be > replaced by new molecules. Uli, can you create a list of say, 100 back ups? > > BTW, I promised to report the list of ZINC ids to the ZINC-developers, so that > they know which molecules we used. projects/050501-0001/zinc_ids.txt in CVS contains the ID's of the 996 molecules that remain BTW, looks like I can't (reliably) use ADAPT to generate comparison data for some descriptors because it won't handle charged species. I recall that DRAGON calculated a number of ADAPT descriptors but I don't have access to it. I've placed the CDK generated data for 3 descriptors in the CVS ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- A sine curve goes off to infinity, or at least the end of the blackboard. -- Prof. Steiner |
From: Egon W. <e.w...@sc...> - 2006-01-19 18:56:33
|
Hi Uli/Rajarshi, I'm glad to see that the test set has been added to CVS, and that new results get uploaded. I saw that Rajarshi removed a few molecules, that could not be converted into 3D models with Corina. So, I guess these (how many are there?) should be replaced by new molecules. Uli, can you create a list of say, 100 back ups? BTW, I promised to report the list of ZINC ids to the ZINC-developers, so that they know which molecules we used. Egon -- e.w...@sc... PhD student on Molecular Representation in Chemometrics Radboud University Nijmegen Blog: http://chem-bla-ics.blogspot.com/ http://www.cac.science.ru.nl/people/egonw/ GPG: 1024D/D6336BA6 |
From: Rajarshi G. <rx...@ps...> - 2006-01-07 06:05:04
|
On Sat, 2006-01-07 at 03:52 +0100, Uli Fechner wrote: > >>- Take the complete set #3 of the ZINC database > >>- calculate the CATS descriptor for all compounds in set #3 > > > > > > I think Egon raised this previously - is CATS going to be open source? > > Most likely yes, even though I cannot say when. This is the only > non-open-source program and it does not make me real happy either. But I > do not have a better idea; so even though this might not be the best way > to go it is the best I came up with. Given that we will be using this descriptor just for selection of a dataset, which will then be publicly available, I think it won't be too much of a problem ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- The Heineken Uncertainty Principle: You can never be sure how many beers you had last night. |
From: Uli F. <u.f...@ch...> - 2006-01-07 02:48:58
|
>>- Take the complete set #3 of the ZINC database >>- calculate the CATS descriptor for all compounds in set #3 > > > I think Egon raised this previously - is CATS going to be open source? Most likely yes, even though I cannot say when. This is the only non-open-source program and it does not make me real happy either. But I do not have a better idea; so even though this might not be the best way to go it is the best I came up with. >>- run a MaxMin selection to yield a diverse subset of 10k compounds >>- compute "reference" descriptor with MOE and/or DRAGON >>- compute CDK descriptor >>- compare the "reference" descriptor values with the one of CDK: >>mean/median difference, max/min difference, %compounds w/ <= 10% >>difference, show ten most different compounds and try to find the >>reason, possible causes of different descriptor values > > > All the above sounds OK. However, I'm still not entirely sure we need > 10K structures as opposed to 1k structures, as I'm going to assume that > in 10K structures many of them will have similar features leading to > similar descriptor values (for a given descriptor) > > But if the consensus is for 10K thats OK with me. I agree with you. Let's just take 1k structures. Makes calculations faster and plotting feasible. >>@Rajarhsi: You mentioned "Also plots should be included, see examples in >>the QA repo for the gravitational index and MI descriptors". Sorry, but >>I did not get to what you are referring..? > > > I had uploaded some validation results for the MI and grav. index > descriptors (compared to the ADAPT implementation) to the cdk-qa CVS > repository. I had included the RMSE values as well as the plots of CDK > value vs ADAPT value. Ah, just had a look at that. > The stats you mention above certainly summarize differences between > descriptor implementations, but having plots would quickly allow > readers/users to identify distinct trends (for example the CDK SA > routine underestimates SA for larger molecules etc) Having just looked at the plots I fully agree; thus the reduction to 1k structures (see above). > Another aspect for a validation writeup is factors that can affect > descriptor calculations (as this is where the difference between the CDK > and other implementations will arise, assuming that the actual > algorithms are correctly implemented). > > Thus for example: > > CPSA descriptors require charges & SA > BCUT/WHIM etc require eigenvalues > BCUT/WHIM can use electronegativities > > and so on. > > Thus I think that in addition to explaining the 10 most extreme > deviations, we also need to identify aspects of decsriptor calculations > that will lead to general differences. At a first go these will be: > > charge calculation > surface areas > electronegativities > atomic radii > eigenvalue decomposition (probably not too significant) Ack. I will consider this. Uli |
From: Rajarshi G. <rx...@ps...> - 2006-01-06 23:54:37
|
On Sat, 2006-01-07 at 00:46 +0100, Uli Fechner wrote: > as there did not show up any further emails on the mailing list > regarding our descriptor QA discussion, I would like to summarize the > opinions and try to propose a final protocol that - hopefully - is to > the satisfaction of all who participated in the discussion. Sorry for the silence - just got back from vacation! > > - Take the complete set #3 of the ZINC database > - calculate the CATS descriptor for all compounds in set #3 I think Egon raised this previously - is CATS going to be open source? > - run a MaxMin selection to yield a diverse subset of 10k compounds > - compute "reference" descriptor with MOE and/or DRAGON > - compute CDK descriptor > - compare the "reference" descriptor values with the one of CDK: > mean/median difference, max/min difference, %compounds w/ <= 10% > difference, show ten most different compounds and try to find the > reason, possible causes of different descriptor values All the above sounds OK. However, I'm still not entirely sure we need 10K structures as opposed to 1k structures, as I'm going to assume that in 10K structures many of them will have similar features leading to similar descriptor values (for a given descriptor) But if the consensus is for 10K thats OK with me. > @Rajarhsi: You mentioned "Also plots should be included, see examples in > the QA repo for the gravitational index and MI descriptors". Sorry, but > I did not get to what you are referring..? I had uploaded some validation results for the MI and grav. index descriptors (compared to the ADAPT implementation) to the cdk-qa CVS repository. I had included the RMSE values as well as the plots of CDK value vs ADAPT value. The stats you mention above certainly summarize differences between descriptor implementations, but having plots would quickly allow readers/users to identify distinct trends (for example the CDK SA routine underestimates SA for larger molecules etc) Another aspect for a validation writeup is factors that can affect descriptor calculations (as this is where the difference between the CDK and other implementations will arise, assuming that the actual algorithms are correctly implemented). Thus for example: CPSA descriptors require charges & SA BCUT/WHIM etc require eigenvalues BCUT/WHIM can use electronegativities and so on. Thus I think that in addition to explaining the 10 most extreme deviations, we also need to identify aspects of decsriptor calculations that will lead to general differences. At a first go these will be: charge calculation surface areas electronegativities atomic radii eigenvalue decomposition (probably not too significant) ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- C Code. C Code Run. Run, Code, RUN! PLEASE!!!! |
From: Egon W. <e.w...@sc...> - 2006-01-06 23:48:45
|
On Saturday 07 January 2006 00:46, Uli Fechner wrote: > @Rajarhsi: You mentioned "Also plots should be included, see examples in > the QA repo for the gravitational index and MI descriptors". Sorry, but > I did not get to what you are referring..? He was refering to the CDK CVS module 'cdk-qa'... That's where we put scripts and results... Egon -- e.w...@sc... PhD student on Molecular Representation in Chemometrics Radboud University Nijmegen Blog: http://chem-bla-ics.blogspot.com/ http://www.cac.science.ru.nl/people/egonw/ GPG: 1024D/D6336BA6 |
From: Uli F. <u.f...@ch...> - 2006-01-06 23:43:12
|
Hello together, as there did not show up any further emails on the mailing list regarding our descriptor QA discussion, I would like to summarize the opinions and try to propose a final protocol that - hopefully - is to the satisfaction of all who participated in the discussion. - Take the complete set #3 of the ZINC database - calculate the CATS descriptor for all compounds in set #3 - run a MaxMin selection to yield a diverse subset of 10k compounds - compute "reference" descriptor with MOE and/or DRAGON - compute CDK descriptor - compare the "reference" descriptor values with the one of CDK: mean/median difference, max/min difference, %compounds w/ <= 10% difference, show ten most different compounds and try to find the reason, possible causes of different descriptor values @Rajarhsi: You mentioned "Also plots should be included, see examples in the QA repo for the gravitational index and MI descriptors". Sorry, but I did not get to what you are referring..? Please give a short feedback that the above is acceptable for you or state what you would like to see instead. Uli Rajarshi Guha wrote: > On Thu, 2006-01-05 at 14:55 +0100, Uli Fechner wrote: > >>>>I would like to propose a protocol for descriptor QA. Please feel free >>>>to comment on this. As the deadline for the upcoming issue of CDKNews is >>>>getting close (January 15th), I would like to come to an agreement on a >>>>descriptor QA protocol until the end of this week. >>>> >>>>Dataset: >>>>- Take complete Set #3 of the ZINC dataset >>>>(http://blaster.docking.org/zinc/bysubset.shtml) >>>>- run a MaxMin selection to yield a diverse subset of 100 000 compounds >>> >>> >>>How large should the subset be? I think 1000 is large enough. >> >>Maybe we can agree on something in between: 10 000 compounds should be >>large enough to yield meaningful results and small enough to allow for >>fast computation times. > > > I would go with a smaller number. My reasoning is that most descriptors > have edge cases which need to be tested. Clearly increasing the size of > the molecule pool will increase the probability that all cases are > tested - but how much more diverse will the 10K pool be compared to the > 1K pool? > > Either way 10K or 1K is not a significant problem since its just a > matter of computation time (though plotting 10K points is not a great > idea!) > > >>>I prefer an open descriptor. What's the publication? What's your estimate of >>>the time required to make an open source implementation for CATS? >> >>I prefer on open descriptor, too. Most likely, CATS will go open source >>somewhen (even though I cannot tell anyhting about the timeframe). I >>already have a CDK-dependent implementation. If there are reasonable >>suggestions other than CATS I would like to follow them. > > > I agree on an open descriptor. What about BCUT's? If this were used > then, these would have to be validated prior to using it - leading to a > chicken and egg problem :( > > Alternatively fingerprints would be useful (but I seem to recall reading > that they can be biased towards smaller compounds when doing library > selection using Tanimoto similarity) > > >>>>The ZINC website states that it is not allowed to re-distribute data >>>>that is downloaded from their website. In other words, we cannot put our >>>>CDK descriptor QA dataset on the CDK website! Does anyone know one of >>>>the ZINC guys to ask for their permission? >>> >>> >>>We should contact them. I'll send an email right now to the list. > > > Even if distributing structures is not allowed, we are allowed to > distribute the ZINC codes that we used(?) Given the codes, its not too > much of a problem to get the structures > > >>>>- Detailed comparison of the descriptor values: mean difference, max/min >>>>difference, % compounds w/ <= 10% difference; anything else here? >>> >>> >>>- median difference >>>- ten most different compounds >>>- list of possible causes of differences, e.g. not using the BO data :) >> >>Fine! > > > Agreed. (Also plots should be included, see examples in the QA repo for > the gravitational index and MI descriptors) > > ------------------------------------------------------------------- > Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > Eureka! > -- Archimedes > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click > _______________________________________________ > Cdk-qassurance mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-qassurance > |
From: Rajarshi G. <rx...@ps...> - 2006-01-05 16:27:57
|
On Thu, 2006-01-05 at 14:55 +0100, Uli Fechner wrote: > >>I would like to propose a protocol for descriptor QA. Please feel free > >>to comment on this. As the deadline for the upcoming issue of CDKNews is > >>getting close (January 15th), I would like to come to an agreement on a > >>descriptor QA protocol until the end of this week. > >> > >>Dataset: > >>- Take complete Set #3 of the ZINC dataset > >>(http://blaster.docking.org/zinc/bysubset.shtml) > >>- run a MaxMin selection to yield a diverse subset of 100 000 compounds > > > > > > How large should the subset be? I think 1000 is large enough. > > Maybe we can agree on something in between: 10 000 compounds should be > large enough to yield meaningful results and small enough to allow for > fast computation times. I would go with a smaller number. My reasoning is that most descriptors have edge cases which need to be tested. Clearly increasing the size of the molecule pool will increase the probability that all cases are tested - but how much more diverse will the 10K pool be compared to the 1K pool? Either way 10K or 1K is not a significant problem since its just a matter of computation time (though plotting 10K points is not a great idea!) > > I prefer an open descriptor. What's the publication? What's your estimate of > > the time required to make an open source implementation for CATS? > > I prefer on open descriptor, too. Most likely, CATS will go open source > somewhen (even though I cannot tell anyhting about the timeframe). I > already have a CDK-dependent implementation. If there are reasonable > suggestions other than CATS I would like to follow them. I agree on an open descriptor. What about BCUT's? If this were used then, these would have to be validated prior to using it - leading to a chicken and egg problem :( Alternatively fingerprints would be useful (but I seem to recall reading that they can be biased towards smaller compounds when doing library selection using Tanimoto similarity) > >>The ZINC website states that it is not allowed to re-distribute data > >>that is downloaded from their website. In other words, we cannot put our > >>CDK descriptor QA dataset on the CDK website! Does anyone know one of > >>the ZINC guys to ask for their permission? > > > > > > We should contact them. I'll send an email right now to the list. Even if distributing structures is not allowed, we are allowed to distribute the ZINC codes that we used(?) Given the codes, its not too much of a problem to get the structures > >>- Detailed comparison of the descriptor values: mean difference, max/min > >>difference, % compounds w/ <= 10% difference; anything else here? > > > > > > - median difference > > - ten most different compounds > > - list of possible causes of differences, e.g. not using the BO data :) > > Fine! Agreed. (Also plots should be included, see examples in the QA repo for the gravitational index and MI descriptors) ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- Eureka! -- Archimedes |
From: Uli F. <u.f...@ch...> - 2006-01-05 13:51:48
|
>>I would like to propose a protocol for descriptor QA. Please feel free >>to comment on this. As the deadline for the upcoming issue of CDKNews is >>getting close (January 15th), I would like to come to an agreement on a >>descriptor QA protocol until the end of this week. >> >>Dataset: >>- Take complete Set #3 of the ZINC dataset >>(http://blaster.docking.org/zinc/bysubset.shtml) >>- run a MaxMin selection to yield a diverse subset of 100 000 compounds > > > How large should the subset be? I think 1000 is large enough. Maybe we can agree on something in between: 10 000 compounds should be large enough to yield meaningful results and small enough to allow for fast computation times. >>We need to choose a descriptor that is used for the maxmin selection. >>This descriptor is not related to the descriptor that is subject to QA! >>I propose to take >>- EITHER the fingerprinter of CDK >>- OR the CATS descriptor of our group here in Frankfurt >> >>Personally, I prefer the CATS descriptor but even though its computation >>is published in detail there is no public implementation available. >>Please feel free to comment on this. > > > I prefer an open descriptor. What's the publication? What's your estimate of > the time required to make an open source implementation for CATS? I prefer on open descriptor, too. Most likely, CATS will go open source somewhen (even though I cannot tell anyhting about the timeframe). I already have a CDK-dependent implementation. If there are reasonable suggestions other than CATS I would like to follow them. >>The ZINC website states that it is not allowed to re-distribute data >>that is downloaded from their website. In other words, we cannot put our >>CDK descriptor QA dataset on the CDK website! Does anyone know one of >>the ZINC guys to ask for their permission? > > > We should contact them. I'll send an email right now to the list. Good. >>Descriptor validation: >>- Calculate descriptor X using the CDK implementation >>- Calculate descriptor X using "reference" implementation: MOE, DRAGON; >>any suggestions for another "reference" program? > > > Sounds good. > > >>- Detailed comparison of the descriptor values: mean difference, max/min >>difference, % compounds w/ <= 10% difference; anything else here? > > > - median difference > - ten most different compounds > - list of possible causes of differences, e.g. not using the BO data :) Fine! Uli |
From: Egon W. <e.w...@sc...> - 2006-01-05 13:25:14
|
Moved discussion to the cdk...@li... ML. On Thursday 05 January 2006 14:15, Uli Fechner wrote: > I would like to propose a protocol for descriptor QA. Please feel free > to comment on this. As the deadline for the upcoming issue of CDKNews is > getting close (January 15th), I would like to come to an agreement on a > descriptor QA protocol until the end of this week. > > Dataset: > - Take complete Set #3 of the ZINC dataset > (http://blaster.docking.org/zinc/bysubset.shtml) > - run a MaxMin selection to yield a diverse subset of 100 000 compounds How large should the subset be? I think 1000 is large enough. > We need to choose a descriptor that is used for the maxmin selection. > This descriptor is not related to the descriptor that is subject to QA! > I propose to take > - EITHER the fingerprinter of CDK > - OR the CATS descriptor of our group here in Frankfurt > > Personally, I prefer the CATS descriptor but even though its computation > is published in detail there is no public implementation available. > Please feel free to comment on this. I prefer an open descriptor. What's the publication? What's your estimate of the time required to make an open source implementation for CATS? > The ZINC website states that it is not allowed to re-distribute data > that is downloaded from their website. In other words, we cannot put our > CDK descriptor QA dataset on the CDK website! Does anyone know one of > the ZINC guys to ask for their permission? We should contact them. I'll send an email right now to the list. > Descriptor validation: > - Calculate descriptor X using the CDK implementation > - Calculate descriptor X using "reference" implementation: MOE, DRAGON; > any suggestions for another "reference" program? Sounds good. > - Detailed comparison of the descriptor values: mean difference, max/min > difference, % compounds w/ <= 10% difference; anything else here? - median difference - ten most different compounds - list of possible causes of differences, e.g. not using the BO data :) > I am very much looking forward to your comments! Me too. E. -- Egon Willighagen http://chem-bla-ics.blogspot.com/ |
From: Rajarshi G. <rx...@ps...> - 2005-12-16 18:33:28
|
On Fri, 2005-12-16 at 16:51 +0100, Egon Willighagen wrote: > On Friday 16 December 2005 16:30, Rajarshi Guha wrote: > > I have validated the moment of inertia and gravitational index > > descriptors against ADAPT. > > Great, thanx! > > > Would it be OK to just send you the table of results (RMSE, and RMSE > > normalized by range) or should I write a seperate article? > > Sure. Make a subdir in CVS, and please add a PDF/EPS/vector graphics of the > y_pred/y_real plot too. Give them clear file names, or use an index file > otherwise. Added the * data files * result summary * individual PDF plots for the grav index and MI descriptors. Also updated the calcDescriptor.jy script to be more useful and easier to handle. Similarly for the calcSA.jy script (which has actually become a general purpose script to eval SA if anybody needs it) ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- "355/113 -- Not the famous irrational number PI, but an incredible simulation!" |
From: Egon W. <e.w...@sc...> - 2005-12-16 15:51:50
|
On Friday 16 December 2005 16:30, Rajarshi Guha wrote: > I have validated the moment of inertia and gravitational index > descriptors against ADAPT. Great, thanx! > Would it be OK to just send you the table of results (RMSE, and RMSE > normalized by range) or should I write a seperate article? Sure. Make a subdir in CVS, and please add a PDF/EPS/vector graphics of the y_pred/y_real plot too. Give them clear file names, or use an index file otherwise. > Also for the SA validation I'm thinking that a seperate article would be > good. The dead line for next issue is Jan 16. Egon -- Egon Willighagen http://chem-bla-ics.blogspot.com/ |
From: Rajarshi G. <rx...@ps...> - 2005-12-16 15:29:48
|
On Fri, 2005-12-16 at 08:10 +0100, ten...@gm... wrote: > hi, > I have validated the CDK XlogP descriptor against the program XlogP and MOE > , looks fine, article for next CDKNews is written. Uli wanted to validate > the TPSA descriptor, I dont know if he could do something. Otherwise I will > also do this, I will need this descriptor and I can compare it to the > Pipeline Pilot one (still no Java running in PipelinePilot because one needs > therefore a server version which I will get next year (2 weeks)). I have validated the moment of inertia and gravitational index descriptors against ADAPT. Would it be OK to just send you the table of results (RMSE, and RMSE normalized by range) or should I write a seperate article? Also for the SA validation I'm thinking that a seperate article would be good. Comments? ------------------------------------------------------------------- Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- All power corrupts, but we need electricity. |