[Joelib-help] Re: [Joelib-devel] Re: [Cdk-devel] QSAR

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

> > I suggest starting not with deciding what program to write but with what
> > the components of a QSAR system are and then deciding what who wants to be
> > involved, we have got and setting some realistic scope to what is
> > achievable
Of course, i like QSAR .. but time is rare and who will implement things
... you know that's my default comment ...

Egon i've read your mail ... and yes i'm still in holiday ... and i do
check 
e-mails and i work since 3 years on QSAR ... so holiday means i can read
fantasy books and can do thinks i like, e.g. read some QSAR papers !:-)
Holiday and spare-time are some curious things .. aren't they :-)

> It seems there is general agreement that an SF project in this area is
> valuable and I'll make a few comments which I hope are helpful. Please
> ignore if they aren't.
I do not agree to open an own project, there is much code out there:
Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka
interface (Xalopy or what was the correct name ?)

I think, we do not want to invent an new data mining standard, such
discussions are more usefull for the Weka mailing list and all
avaliable Matlab algorithm providers (toolboxes !!!) ...
... and such discussions are not new (see Weka mailing-list) !!!
I think we are interested to provide the best useable appraoch 
with implemented algorithms available, so let's use the already
available ones and extend them !!!

IMHO:
!!! The problem is not the missing 'data mining'-standard. The problem
is the misuse of
1. a general molecular-structure-coding with these standard algorithms !!!
2. applying these algorithms correctly
So let's focus this problem first !!! This is a problam of CDK and JOELib
and only if we have solved this, we can solve the next one.
Furthermore i will publish in the next time:
- the extended Weka interface
- the MaximumCommonSubstructure (MCS) algorithms
- The Metric-Interface is still available and is used by the AtomPair-
  descriptor
  Weka-Clusterers with Molecular-Metrics are planned and will be
  implemented next. The Cluster-Matlab-Molecule connection is to difficult
  at the moment, because the similarity metric must be coded under Matlab
  or we use indices ...

So again, i'm using a lot of interfaces and i do not like another one !!!
Will it not be easier to add CDK- and JOELib-PlugIns.
Do not make the algorithms to easy for chemists, probably they think
hypothesis-testing is an easy tasks and the molecular structure is the
most important thing ... IMHO ... that's badly wrong !!! So force them
to read the data mining/interface manual carefully. Descriptor
dependencies
are NOT all linear 2D dependencies as already excellently mentioned by 
Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ?
That's mainly irrelevant and misleading ! A 2D plot is only one 
possibility for the model quality, and not always the best one !!!

> A. Current QSAR practice has severe problems. They include:
> - almost all codes are closed. Many are not free.
Exact:
Descriptors: Dragon, MolConnZ, ...
Algorithms: Often unpublished code with hiding most of the paramaters,
            also important ones

> - it is impossible to repeat any experiment. Therefore QSAR ceases to be
> scientific but relies on reputation, trust and power
> - the objects used are badly designed, irreproducible and have variable
> interpretation
> - data selection is arbitrary. There are few (no?) standard test sets.
It
> is impossible to verify whether data have be modified consciously or
> unconsciously to increase apparent success
> - algorithms are closed, even if the data are well defined.
Agree fully, four times !
Oh, i've some nice slides i can present for these points ... :-)

> B. The mainstream QSAR community is not taking effective steps to remedy
> the errors. Our current group believes that through an OpenSource
approach
> we can catalyse a change in thinking and practice. We do this by
creating a
> system and practice that demonstrates the increased **quality**
available
> through OpenSource. IMO quality is the most important - more so than
> platform, language, ease of use, performance, etc. If it is easier and
> faster to create more garbage on every platform what have we achieved?

1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that
not
everybody like this theorem (still apriori) ... BUT ... now we have a huge
amount of algorithms ... which one to pick ? It's 'easy' to find one
algorithm and one feature set to explain one data set perfectly !

2. And we are not all algorithm developers, so use the existing libraries
which the main-stream user can use. There is still enough room to make
errors, also if we must not reimplement algorithms !!!

3. A QSAR framework is not easy, because there are a lot of different
opinions:
3.1. how to present structures, e.g. CDK<->JOELib
3.2. models (hypothesis building algorithms) are really abstract and do
not
     forget the nested and highly interesting meta algorithms with
recursive
     character, so let's forget the C++ libraries and concentrate on the 
     Java and Matlab (Java GUI) libraries (R?) with their flexible
reflection
     mechanism!
3.3. results ... uhhh ... cross-validation, feature selection, data set
     splitting ?     
     Do not forget that we talk about molecular structures, so ...
3.4. Big descriptor files with normalized descriptors, missing values, if 
     instable numeric descriptors or they depend on molecule size, ...
3.5. Are we working in memory or on files ??? For hypothesis building we 
     are hopefully are working on memory, but the preprocessing steps do
     not underly this restriction.

Sorry, CDK for descriptors is not obvious to me, please explain. As you
can mention, i do not agree for several reasons, as already discussed
previously, e.g. missing atom typer and missing substructure search !
(molecular-structure-coding ... is restricted to applied expert systems)

Why do we need again a new project, do we not have enough interface
maintenance 'problems' with the actual projects !?
1. I think the standard should be a file format or CML, but this does not
   help at all, this can only save time by using more space !
   You-Know: Time-Space-Complexity
2. Often on-the-fly calculations are required, so this will require
   JOELib or CDK or
   external JOELib module (which exists already: Corina, Petra, XLogP,...)
   So we need a molecule data structure, so which one to use ?
   Again implement a new interface ? Why ? I can't see the advantage ?

2.1. Interface to Molecules:
     - JOELib (available)
     - CDK (available)
     - Ghemical/Mopac (available in JOELib)
     - OpenBabel (JNI, same object structure as JOELib, but is this
       usefull ?)
     - Tinker 

2.2. Interface to data mining packages
    - Weka (available in JOELib/JCompChem)
    - JavaNNS (SNNS sucessor, available in JOELib/JCompChem)
    - LibSVM (available in JOELib/JCompChem)
    - Matlab and it's 1001 free-packages (available in JOELib/JCompChem)
    - Yale uses Weka
    - Data mining API
    - ... to much such stuff ... all mostly incompatible ... let's use
      Weka, that's the most serious used OpenSource approach.
      Data Miners will implement their algorithms for it, we can use them
!
    - let's use Matlab and/or R

3. Visualization:
3.1. Molecules: Can be done with CDK and with JOELib also highlighted
     SMARTS substructures:
     2D layout CDK 
     3D layout JOElib (Corina, Ghemical, orYourInterface)
3.2. Data: what, histograms, plots, 3D plots , ...
     no interest to implement such things, that's boring and does not
     help at all, because Weka, Matlab, R have all their own tools
     and which one do you prefer ?
     What's with independent packages, like libSVM, our JavaNNS
     (SNNS successor), ...
     So we nedd an interface for all, that's nearly impossible in a short
     time period.
     I use most often the Java->Matlab interface, this is nothing special
     only the adapted JMatLink connection.

... and another advantage of holiday and weekeend ... i can write really
long e-mails :-)

Kind regards, Joerg

> C. The OpenSource community has made some small, useful steps in this
> direction. They now wish to pool their efforts and produce a single
point
> of contact for their own development and to show to the world. This does
> NOT necessarily mean a single program. IMO it is much more likely to
mean
> an infrastructure on which a variety of operations can be carried out
> ("glueware"?). They wish to create a project at SF which leads to:
> - active constructive discussion
> - agreed representation of objects
>    * molecules, atoms, fragments, etc.
>    * descriptors
>    * properties
> - creation, cataloguing, annotating, high-quality information objects:
>    * dictionaries
>    * properties (e.g. of atoms)
>    * datasets
> - creation, cataloguing, annotation of algorithms related to QSAR
>    * chemical perception
>    * statistics, optimisation, etc
> - creation of software:
>    * as toolkit components
>    * as demonstrators of the *quality* of the system
>
> That is as far as I have got...
>
> I think it's important to be inclusive and I would therefore suggest
that
> we review the current OpenSource efforts in this area. My knowledge
extends to:
> - CDK, etc.
> - JOELib
> - OpenBabel
> - Weka
> - Nina's work (does this have a label?)
>
> In projects of this sort everyone has something to contribute and also
> something to give up. For example I did a lot of work on visual display
of
> CML (Jumbo3) - and some of this functionality is not provided by other
> sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol
for
> display. JUMBO4.3 has now developed in a more structured form as a
flexible
> XML DOM and Tools  library which can be reconfigured easily and rapidly.
It
> is component based rather than application based.
>
> I suggest starting not with deciding what program to write but with what
> the components of a QSAR system are and then deciding what who wants to
be
> involved, we have got and setting some realistic scope to what is
achievable.
>
> Best
>
> P.
>

Dipl. Chem. Joerg K. Wegner
Center of Bioinformatics Tuebingen (ZBIT)
Department of Computer Architecture
Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany
Phone: (+49/0) 7071 29 78970
Fax: (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de
--
Never mistake motion for action.
                                    (E. Hemingway)

Never mistake action for meaningful action.
                               (Hugo Kubinyi,2004)