[Joelib-help] Re: [Joelib-devel] Re: [Cdk-devel] QSAR

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

> > I do not agree to open an own project, there is much code out there:
> > Weka, YALE (includes Weka interface) and XML, Commercial stuff with
Weka
> > interface (Xalopy or what was the correct name ?)
> A new project does not mean that available pieces cannot be used...
So who can decide which classes are the best main classes ?
Do you know a critical mass of Weka and JOELib classes you can use ?
Do i know all CDK classes i can use ?
What's with R, Yale, JavaNNS (SNNS successor), JavaEVA (EVA successor),
libSVM, 'feature extraction', clustering ...

> > I think, we do not want to invent an new data mining standard, such
> > discussions are more usefull for the Weka mailing list and all
> > avaliable Matlab algorithm providers (toolboxes !!!) ...
> Not everyone prefer to work with Matlab... Matlab is not free, neither
is the
> PLS Toolbox... What's the URL for Weka?
Google: Weka, Java, Data Mining
That's irrelevant, i've plenty of 'feature extraction' methods, you must
not buy=20
commercial toolboxes, there is a lot of free stuff, or use R ...
... the problem is mixing all together ... i use these things and i'm far
from
feeling experienced enough to define a common interface !
I think this is more a evolutionary process, use it and then you find way=
s
you can faciliate the usage, but a faciliated usage causes a more complex
interface so ... every new API requires time to understand their
approaches ... and can save development time ...

> > - the MaximumCommonSubstructure (MCS) algorithms
> Is this an improved algorithm, or similar to that in CDK?
1. I can assign different chemical graph labels=20
1.1. basic atom types
1.2. general PATTY
1.3. atom properties threshold
1.4. atom properties difference

2. MCS by clique detection
2.1. Bron-Kerbosch (exact)
2.2. DFMax (fast heuristic, non-exact)

3. multiple MCS
3.1. HSCS (Sheridan approach)
3.2. stochastic version

4. feature reduction step available for 1.1.-1.3.

Beside these things, there exists also the incremental
graph isomorphism algorithm for SMARTS matching (Ullmann variant
with backtracking)

> > Sorry, CDK for descriptors is not obvious to me, please explain. As
you
> > can mention, i do not agree for several reasons, as already discussed
> > previously, e.g. missing atom typer and missing substructure search !
> CDK *has* substructure search, implemented in a rather flexible way.
Graph isomorphism is not the same as substructure search ! (See definitio=
n
Subgraph/Substructure by R=FCcker/R=FCcker)
Or which expert systems do you use to assign the graph labels of the
'attributed graph' ? (in general: things i critisize in my submitted pape=
r
!)
In fact, nearly every software uses it's own 'labelling', so which one
is correct ? standard ?
The isomorphism is not the problem, because we talk about exact matching,
of course there exists other kind of matchings, like ... here you will
need
an optimization algorithm, like our JavaEVA library ...

> > Descriptor dependencies
> > are NOT all linear 2D dependencies as already excellently mentioned b=
y
> > Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D
?
> > That's mainly irrelevant and misleading ! A 2D plot is only one
> > possibility for the model quality, and not always the best one !!!
> What kind of 2D are you talking about here?
E.g. plain correlation plots between descriptorXYZ and predictedVALUE.
Such
things can be helpfull, but such an approach is similar to visual 'featur=
e
selection' on one feature and it is well know, that important features ar=
e
not the best ones, from the standpoint of generalization ability
(see Eibe/Witten or my submitted paper, if accepted :-)

> I have no idea what a data mining API is... data mining is a rather
vague
> term... like chemometrics API.
That's the point !!!
I'm more interested to implement all required methods and extensions in
JOELib/CDK, because the hypothetical interface will access these methods
anyway !
Furthermore i'm more interested to implement access/algorithms speed-ups.
That's what i call 'maintenance' problems. The libraries are still
complex,
so i'm more interested to write more examples, more tutorial, including
more literature references, ...

Kind regards, Joerg

Dipl. Chem. Joerg K. Wegner
Center of Bioinformatics Tuebingen (ZBIT)
Department of Computer Architecture
Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany
Phone: (+49/0) 7071 29 78970
Fax: (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de
--
Never mistake motion for action.
                                    (E. Hemingway)
                        =20
Never mistake action for meaningful action.
                               (Hugo Kubinyi,2004)                       =
 =20