Re: [Cdk-devel] QSAR

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

At 09:52 17/04/2004 +0200, Joerg Wegner wrote:
>Dear Nina Nikolova,
>Dear All,
>
>please reply also to the JOELib mailing list and ... i've already
>published three papers about QSAR and our group has it's main focus on
>data mining and optimization algorithms, so i think i've some experience
>in this area, too.
>http://www-ra.informatik.uni-tuebingen.de/

It seems there is general agreement that an SF project in this area is 
valuable and I'll make a few comments which I hope are helpful. Please 
ignore if they aren't.

A. Current QSAR practice has severe problems. They include:
- almost all codes are closed. Many are not free.
- it is impossible to repeat any experiment. Therefore QSAR ceases to be 
scientific but relies on reputation, trust and power
- the objects used are badly designed, irreproducible and have variable 
interpretation
- data selection is arbitrary. There are few (no?) standard test sets. It 
is impossible to verify whether data have be modified consciously or 
unconsciously to increase apparent success
- algorithms are closed, even if the data are well defined.

B. The mainstream QSAR community is not taking effective steps to remedy 
the errors. Our current group believes that through an OpenSource approach 
we can catalyse a change in thinking and practice. We do this by creating a 
system and practice that demonstrates the increased **quality** available 
through OpenSource. IMO quality is the most important - more so than 
platform, language, ease of use, performance, etc. If it is easier and 
faster to create more garbage on every platform what have we achieved?

C. The OpenSource community has made some small, useful steps in this 
direction. They now wish to pool their efforts and produce a single point 
of contact for their own development and to show to the world. This does 
NOT necessarily mean a single program. IMO it is much more likely to mean 
an infrastructure on which a variety of operations can be carried out 
("glueware"?). They wish to create a project at SF which leads to:
- active constructive discussion
- agreed representation of objects
   * molecules, atoms, fragments, etc.
   * descriptors
   * properties
- creation, cataloguing, annotating, high-quality information objects:
   * dictionaries
   * properties (e.g. of atoms)
   * datasets
- creation, cataloguing, annotation of algorithms related to QSAR
   * chemical perception
   * statistics, optimisation, etc
- creation of software:
   * as toolkit components
   * as demonstrators of the *quality* of the system

That is as far as I have got...

I think it's important to be inclusive and I would therefore suggest that 
we review the current OpenSource efforts in this area. My knowledge extends to:
- CDK, etc.
- JOELib
- OpenBabel
- Weka
- Nina's work (does this have a label?)

In projects of this sort everyone has something to contribute and also 
something to give up. For example I did a lot of work on visual display of 
CML (Jumbo3) - and some of this functionality is not provided by other 
sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for 
display. JUMBO4.3 has now developed in a more structured form as a flexible 
XML DOM and Tools  library which can be reconfigured easily and rapidly. It 
is component based rather than application based.

I suggest starting not with deciding what program to write but with what 
the components of a QSAR system are and then deciding what who wants to be 
involved, we have got and setting some realistic scope to what is achievable.

Best

P.

Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069