From: Joerg W. <we...@in...> - 2004-04-18 10:45:56
|
Hi all, > > I do not agree to open an own project, there is much code out there: > > Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka > > interface (Xalopy or what was the correct name ?) > A new project does not mean that available pieces cannot be used... So who can decide which classes are the best main classes ? Do you know a critical mass of Weka and JOELib classes you can use ? Do i know all CDK classes i can use ? What's with R, Yale, JavaNNS (SNNS successor), JavaEVA (EVA successor), libSVM, 'feature extraction', clustering ... > > I think, we do not want to invent an new data mining standard, such > > discussions are more usefull for the Weka mailing list and all > > avaliable Matlab algorithm providers (toolboxes !!!) ... > Not everyone prefer to work with Matlab... Matlab is not free, neither is the > PLS Toolbox... What's the URL for Weka? Google: Weka, Java, Data Mining That's irrelevant, i've plenty of 'feature extraction' methods, you must not buy=20 commercial toolboxes, there is a lot of free stuff, or use R ... ... the problem is mixing all together ... i use these things and i'm far from feeling experienced enough to define a common interface ! I think this is more a evolutionary process, use it and then you find way= s you can faciliate the usage, but a faciliated usage causes a more complex interface so ... every new API requires time to understand their approaches ... and can save development time ... > > - the MaximumCommonSubstructure (MCS) algorithms > Is this an improved algorithm, or similar to that in CDK? 1. I can assign different chemical graph labels=20 1.1. basic atom types 1.2. general PATTY 1.3. atom properties threshold 1.4. atom properties difference 2. MCS by clique detection 2.1. Bron-Kerbosch (exact) 2.2. DFMax (fast heuristic, non-exact) 3. multiple MCS 3.1. HSCS (Sheridan approach) 3.2. stochastic version 4. feature reduction step available for 1.1.-1.3. Beside these things, there exists also the incremental graph isomorphism algorithm for SMARTS matching (Ullmann variant with backtracking) > > Sorry, CDK for descriptors is not obvious to me, please explain. As you > > can mention, i do not agree for several reasons, as already discussed > > previously, e.g. missing atom typer and missing substructure search ! > CDK *has* substructure search, implemented in a rather flexible way. Graph isomorphism is not the same as substructure search ! (See definitio= n Subgraph/Substructure by R=FCcker/R=FCcker) Or which expert systems do you use to assign the graph labels of the 'attributed graph' ? (in general: things i critisize in my submitted pape= r !) In fact, nearly every software uses it's own 'labelling', so which one is correct ? standard ? The isomorphism is not the problem, because we talk about exact matching, of course there exists other kind of matchings, like ... here you will need an optimization algorithm, like our JavaEVA library ... > > Descriptor dependencies > > are NOT all linear 2D dependencies as already excellently mentioned b= y > > Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? > > That's mainly irrelevant and misleading ! A 2D plot is only one > > possibility for the model quality, and not always the best one !!! > What kind of 2D are you talking about here? E.g. plain correlation plots between descriptorXYZ and predictedVALUE. Such things can be helpfull, but such an approach is similar to visual 'featur= e selection' on one feature and it is well know, that important features ar= e not the best ones, from the standpoint of generalization ability (see Eibe/Witten or my submitted paper, if accepted :-) > I have no idea what a data mining API is... data mining is a rather vague > term... like chemometrics API. That's the point !!! I'm more interested to implement all required methods and extensions in JOELib/CDK, because the hypothetical interface will access these methods anyway ! Furthermore i'm more interested to implement access/algorithms speed-ups. That's what i call 'maintenance' problems. The libraries are still complex, so i'm more interested to write more examples, more tutorial, including more literature references, ... Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) =20 Never mistake action for meaningful action. (Hugo Kubinyi,2004) = =20 |