Re: [Joelib-devel] Re: [Cdk-devel] QSAR

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Saturday 17 April 2004 19:25, Joerg Wegner wrote:
> > > I suggest starting not with deciding what program to write but with
> > > what the components of a QSAR system are and then deciding what who
> > > wants to be involved, we have got and setting some realistic scope to
> > > what is achievable
>
> Of course, i like QSAR .. but time is rare and who will implement things
> ... you know that's my default comment ...
>
> Egon i've read your mail ... and yes i'm still in holiday ... and i do
> check
> e-mails and i work since 3 years on QSAR ... so holiday means i can read
> fantasy books and can do thinks i like, e.g. read some QSAR papers !:-)
> Holiday and spare-time are some curious things .. aren't they :-)

:)

> > It seems there is general agreement that an SF project in this area is
> > valuable and I'll make a few comments which I hope are helpful. Please
> > ignore if they aren't.
>
> I do not agree to open an own project, there is much code out there:
> Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka
> interface (Xalopy or what was the correct name ?)

A new project does not mean that available pieces cannot be used...

> I think, we do not want to invent an new data mining standard, such
> discussions are more usefull for the Weka mailing list and all
> avaliable Matlab algorithm providers (toolboxes !!!) ...

Not everyone prefer to work with Matlab... Matlab is not free, neither is the 
PLS Toolbox... What's the URL for Weka?

> ... and such discussions are not new (see Weka mailing-list) !!!
> I think we are interested to provide the best useable appraoch
> with implemented algorithms available, so let's use the already
> available ones and extend them !!!

Absolutely. If that has not been clear so far, I prefer to use existing stuff 
as much as possible, but I do prefer some tools over others, which is in
general too, so we need to develop wrappers using a unified interface...

> IMHO:
> !!! The problem is not the missing 'data mining'-standard. The problem
> is the misuse of
> 1. a general molecular-structure-coding with these standard algorithms !!!
> 2. applying these algorithms correctly
> So let's focus this problem first !!! 

Not everyone agrees on how methods/algorithms should be applied... but I agree
that there is plenty of weird use of methods in some QSAR research... 

I think that providing people with an easy to use, clear and well defined 
program will make it much easier to teach others what things should be taken 
into account when making models...

> This is a problam of CDK and JOELib 
> and only if we have solved this, we can solve the next one.
> Furthermore i will publish in the next time:
> - the extended Weka interface

Looking forward to reading that...

> - the MaximumCommonSubstructure (MCS) algorithms

Is this an improved algorithm, or similar to that in CDK?

> - The Metric-Interface is still available and is used by the AtomPair-
>   descriptor
>   Weka-Clusterers with Molecular-Metrics are planned and will be
>   implemented next. The Cluster-Matlab-Molecule connection is to difficult
>   at the moment, because the similarity metric must be coded under Matlab
>   or we use indices ...

Not sure what you mean here...

> So again, i'm using a lot of interfaces and i do not like another one !!!

Fine. I don't think we will need to reinvent what you did. I'm, and I guess 
others too, are fine with using interfaces similar or identical to yours...

> Will it not be easier to add CDK- and JOELib-PlugIns.
> Do not make the algorithms to easy for chemists, probably they think
> hypothesis-testing is an easy tasks and the molecular structure is the
> most important thing ... IMHO ... that's badly wrong !!! 

Mmm... not sure I agree here... chemists are our target... likely even 
biologists (no offense... :)

> So force them 
> to read the data mining/interface manual carefully. 

Ok, can you explain me what the goal is here? I.e. what should they learn from 
understanding the interface?

> Descriptor dependencies
> are NOT all linear 2D dependencies as already excellently mentioned by
> Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ?
> That's mainly irrelevant and misleading ! A 2D plot is only one
> possibility for the model quality, and not always the best one !!!

What kind of 2D are you talking about here?

> > A. Current QSAR practice has severe problems. They include:
> > - almost all codes are closed. Many are not free.
>
> Exact:
> Descriptors: Dragon, MolConnZ, ...
> Algorithms: Often unpublished code with hiding most of the paramaters,
>             also important ones
>
> > - it is impossible to repeat any experiment. Therefore QSAR ceases to be
> > scientific but relies on reputation, trust and power
> > - the objects used are badly designed, irreproducible and have variable
> > interpretation
> > - data selection is arbitrary. There are few (no?) standard test sets.
> > It is impossible to verify whether data have be modified consciously or
> > unconsciously to increase apparent success
> > - algorithms are closed, even if the data are well defined.
>
> Agree fully, four times !
> Oh, i've some nice slides i can present for these points ... :-)
>
> > B. The mainstream QSAR community is not taking effective steps to remedy
> > the errors. Our current group believes that through an OpenSource
>
> approach
>
> > we can catalyse a change in thinking and practice. We do this by
>
> creating a
>
> > system and practice that demonstrates the increased **quality**
>
> available
>
> > through OpenSource. IMO quality is the most important - more so than
> > platform, language, ease of use, performance, etc. If it is easier and
> > faster to create more garbage on every platform what have we achieved?
>
> 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that
> not
> everybody like this theorem (still apriori) ... BUT ... now we have a huge
> amount of algorithms ... which one to pick ? It's 'easy' to find one
> algorithm and one feature set to explain one data set perfectly !
>
> 2. And we are not all algorithm developers, so use the existing libraries
> which the main-stream user can use. There is still enough room to make
> errors, also if we must not reimplement algorithms !!!
>
> 3. A QSAR framework is not easy, because there are a lot of different
> opinions:

Correct. Hence the proposed the new SF project to discuss and implement these 
things...

> 3.1. how to present structures, e.g. CDK<->JOELib
> 3.2. models (hypothesis building algorithms) are really abstract and do
> not
>      forget the nested and highly interesting meta algorithms with
> recursive
>      character, so let's forget the C++ libraries and concentrate on the
>      Java and Matlab (Java GUI) libraries (R?) with their flexible
> reflection
>      mechanism!
> 3.3. results ... uhhh ... cross-validation, feature selection, data set
>      splitting ?
>      Do not forget that we talk about molecular structures, so ...
> 3.4. Big descriptor files with normalized descriptors, missing values, if
>      instable numeric descriptors or they depend on molecule size, ...
> 3.5. Are we working in memory or on files ??? For hypothesis building we
>      are hopefully are working on memory, but the preprocessing steps do
>      not underly this restriction.

Much of this has already be discussed in the thread. True nevertheless.

> Sorry, CDK for descriptors is not obvious to me, please explain. As you
> can mention, i do not agree for several reasons, as already discussed
> previously, e.g. missing atom typer and missing substructure search !

CDK *has* substructure search, implemented in a rather flexible way.

> (molecular-structure-coding ... is restricted to applied expert systems)
>
> Why do we need again a new project,

As said above, a new project does not equal starting from scratch.

> do we not have enough interface 
> maintenance 'problems' with the actual projects !?
> 1. I think the standard should be a file format or CML, but this does not
>    help at all, this can only save time by using more space !
>    You-Know: Time-Space-Complexity

I have not seen the Heisenberg relation for this yet...

> 2. Often on-the-fly calculations are required, so this will require
>    JOELib or CDK or
>    external JOELib module (which exists already: Corina, Petra, XLogP,...)
>    So we need a molecule data structure, so which one to use ?
>    Again implement a new interface ? Why ? I can't see the advantage ?

See thread.

> 2.1. Interface to Molecules:
>      - JOELib (available)
>      - CDK (available)
>      - Ghemical/Mopac (available in JOELib)
>      - OpenBabel (JNI, same object structure as JOELib, but is this
>        usefull ?)
>      - Tinker
>
> 2.2. Interface to data mining packages
>     - Weka (available in JOELib/JCompChem)
>     - JavaNNS (SNNS sucessor, available in JOELib/JCompChem)
>     - LibSVM (available in JOELib/JCompChem)
>     - Matlab and it's 1001 free-packages (available in JOELib/JCompChem)

Too bad Matlab itself is not...

>     - Yale uses Weka
>     - Data mining API

Let's use a chemometrics API. :)

I have no idea what a data mining API is... data mining is a rather vague 
term... like chemometrics API.

>     - ... to much such stuff ... all mostly incompatible ... let's use
>       Weka, that's the most serious used OpenSource approach.
>       Data Miners will implement their algorithms for it, we can use them !
>     - let's use Matlab and/or R

Let's have that plugable. So that anyone can choose whatever program they 
like.

> 3. Visualization:
> 3.1. Molecules: Can be done with CDK and with JOELib also highlighted
>      SMARTS substructures:
>      2D layout CDK
>      3D layout JOElib (Corina, Ghemical, orYourInterface)
> 3.2. Data: what, histograms, plots, 3D plots , ...
>      no interest to implement such things, that's boring and does not
>      help at all, because Weka, Matlab, R have all their own tools
>      and which one do you prefer ?
>      What's with independent packages, like libSVM, our JavaNNS
>      (SNNS successor), ...
>      So we nedd an interface for all, that's nearly impossible in a short
>      time period.
>      I use most often the Java->Matlab interface, this is nothing special
>      only the adapted JMatLink connection.
>
> ... and another advantage of holiday and weekeend ... i can write really
> long e-mails :-)

Thanx for this analysis. And don't spend to much of your holiday on these 
kinds of emails... though it is difficult not to respond. :)

> Kind regards, Joerg

Have a nice continuation of you holiday!

Egon