Thread: [Cdk-qassurance] Re: Descriptor QA

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Moved discussion to the cdk...@li... ML.

On Thursday 05 January 2006 14:15, Uli Fechner wrote:
> I would like to propose a protocol for descriptor QA. Please feel free
> to comment on this. As the deadline for the upcoming issue of CDKNews is
> getting close (January 15th), I would like to come to an agreement on a
> descriptor QA protocol until the end of this week.
>
> Dataset:
> - Take complete Set #3 of the ZINC dataset
> (http://blaster.docking.org/zinc/bysubset.shtml)
> - run a MaxMin selection to yield a diverse subset of 100 000 compounds

How large should the subset be? I think 1000 is large enough.

> We need to choose a descriptor that is used for the maxmin selection.
> This descriptor is not related to the descriptor that is subject to QA!
> I propose to take
> - EITHER the fingerprinter of CDK
> - OR the CATS descriptor of our group here in Frankfurt
>
> Personally, I prefer the CATS descriptor but even though its computation
> is published in detail there is no public implementation available.
> Please feel free to comment on this.

I prefer an open descriptor. What's the publication? What's your estimate of 
the time required to make an open source implementation for CATS?

> The ZINC website states that it is not allowed to re-distribute data
> that is downloaded from their website. In other words, we cannot put our
> CDK descriptor QA dataset on the CDK website! Does anyone know one of
> the ZINC guys to ask for their permission?

We should contact them. I'll send an email right now to the list.

> Descriptor validation:
> - Calculate descriptor X using the CDK implementation
> - Calculate descriptor X using "reference" implementation: MOE, DRAGON;
> any suggestions for another "reference" program?

Sounds good.

> - Detailed comparison of the descriptor values: mean difference, max/min
> difference, % compounds w/ <= 10% difference; anything else here?

- median difference
- ten most different compounds
- list of possible causes of differences, e.g. not using the BO data :)

> I am very much looking forward to your comments!

Me too.

E.

-- 
Egon Willighagen
http://chem-bla-ics.blogspot.com/

Thread: [Cdk-qassurance] Re: Descriptor QA

cdk-qassurance