From: Joerg W. <we...@in...> - 2004-04-17 17:25:49
|
Hi all, > > I suggest starting not with deciding what program to write but with what > > the components of a QSAR system are and then deciding what who wants to be > > involved, we have got and setting some realistic scope to what is > > achievable Of course, i like QSAR .. but time is rare and who will implement things ... you know that's my default comment ... Egon i've read your mail ... and yes i'm still in holiday ... and i do check e-mails and i work since 3 years on QSAR ... so holiday means i can read fantasy books and can do thinks i like, e.g. read some QSAR papers !:-) Holiday and spare-time are some curious things .. aren't they :-) > It seems there is general agreement that an SF project in this area is > valuable and I'll make a few comments which I hope are helpful. Please > ignore if they aren't. I do not agree to open an own project, there is much code out there: Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka interface (Xalopy or what was the correct name ?) I think, we do not want to invent an new data mining standard, such discussions are more usefull for the Weka mailing list and all avaliable Matlab algorithm providers (toolboxes !!!) ... ... and such discussions are not new (see Weka mailing-list) !!! I think we are interested to provide the best useable appraoch with implemented algorithms available, so let's use the already available ones and extend them !!! IMHO: !!! The problem is not the missing 'data mining'-standard. The problem is the misuse of 1. a general molecular-structure-coding with these standard algorithms !!! 2. applying these algorithms correctly So let's focus this problem first !!! This is a problam of CDK and JOELib and only if we have solved this, we can solve the next one. Furthermore i will publish in the next time: - the extended Weka interface - the MaximumCommonSubstructure (MCS) algorithms - The Metric-Interface is still available and is used by the AtomPair- descriptor Weka-Clusterers with Molecular-Metrics are planned and will be implemented next. The Cluster-Matlab-Molecule connection is to difficult at the moment, because the similarity metric must be coded under Matlab or we use indices ... So again, i'm using a lot of interfaces and i do not like another one !!! Will it not be easier to add CDK- and JOELib-PlugIns. Do not make the algorithms to easy for chemists, probably they think hypothesis-testing is an easy tasks and the molecular structure is the most important thing ... IMHO ... that's badly wrong !!! So force them to read the data mining/interface manual carefully. Descriptor dependencies are NOT all linear 2D dependencies as already excellently mentioned by Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? That's mainly irrelevant and misleading ! A 2D plot is only one possibility for the model quality, and not always the best one !!! > A. Current QSAR practice has severe problems. They include: > - almost all codes are closed. Many are not free. Exact: Descriptors: Dragon, MolConnZ, ... Algorithms: Often unpublished code with hiding most of the paramaters, also important ones > - it is impossible to repeat any experiment. Therefore QSAR ceases to be > scientific but relies on reputation, trust and power > - the objects used are badly designed, irreproducible and have variable > interpretation > - data selection is arbitrary. There are few (no?) standard test sets. It > is impossible to verify whether data have be modified consciously or > unconsciously to increase apparent success > - algorithms are closed, even if the data are well defined. Agree fully, four times ! Oh, i've some nice slides i can present for these points ... :-) > B. The mainstream QSAR community is not taking effective steps to remedy > the errors. Our current group believes that through an OpenSource approach > we can catalyse a change in thinking and practice. We do this by creating a > system and practice that demonstrates the increased **quality** available > through OpenSource. IMO quality is the most important - more so than > platform, language, ease of use, performance, etc. If it is easier and > faster to create more garbage on every platform what have we achieved? 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that not everybody like this theorem (still apriori) ... BUT ... now we have a huge amount of algorithms ... which one to pick ? It's 'easy' to find one algorithm and one feature set to explain one data set perfectly ! 2. And we are not all algorithm developers, so use the existing libraries which the main-stream user can use. There is still enough room to make errors, also if we must not reimplement algorithms !!! 3. A QSAR framework is not easy, because there are a lot of different opinions: 3.1. how to present structures, e.g. CDK<->JOELib 3.2. models (hypothesis building algorithms) are really abstract and do not forget the nested and highly interesting meta algorithms with recursive character, so let's forget the C++ libraries and concentrate on the Java and Matlab (Java GUI) libraries (R?) with their flexible reflection mechanism! 3.3. results ... uhhh ... cross-validation, feature selection, data set splitting ? Do not forget that we talk about molecular structures, so ... 3.4. Big descriptor files with normalized descriptors, missing values, if instable numeric descriptors or they depend on molecule size, ... 3.5. Are we working in memory or on files ??? For hypothesis building we are hopefully are working on memory, but the preprocessing steps do not underly this restriction. Sorry, CDK for descriptors is not obvious to me, please explain. As you can mention, i do not agree for several reasons, as already discussed previously, e.g. missing atom typer and missing substructure search ! (molecular-structure-coding ... is restricted to applied expert systems) Why do we need again a new project, do we not have enough interface maintenance 'problems' with the actual projects !? 1. I think the standard should be a file format or CML, but this does not help at all, this can only save time by using more space ! You-Know: Time-Space-Complexity 2. Often on-the-fly calculations are required, so this will require JOELib or CDK or external JOELib module (which exists already: Corina, Petra, XLogP,...) So we need a molecule data structure, so which one to use ? Again implement a new interface ? Why ? I can't see the advantage ? 2.1. Interface to Molecules: - JOELib (available) - CDK (available) - Ghemical/Mopac (available in JOELib) - OpenBabel (JNI, same object structure as JOELib, but is this usefull ?) - Tinker 2.2. Interface to data mining packages - Weka (available in JOELib/JCompChem) - JavaNNS (SNNS sucessor, available in JOELib/JCompChem) - LibSVM (available in JOELib/JCompChem) - Matlab and it's 1001 free-packages (available in JOELib/JCompChem) - Yale uses Weka - Data mining API - ... to much such stuff ... all mostly incompatible ... let's use Weka, that's the most serious used OpenSource approach. Data Miners will implement their algorithms for it, we can use them ! - let's use Matlab and/or R 3. Visualization: 3.1. Molecules: Can be done with CDK and with JOELib also highlighted SMARTS substructures: 2D layout CDK 3D layout JOElib (Corina, Ghemical, orYourInterface) 3.2. Data: what, histograms, plots, 3D plots , ... no interest to implement such things, that's boring and does not help at all, because Weka, Matlab, R have all their own tools and which one do you prefer ? What's with independent packages, like libSVM, our JavaNNS (SNNS successor), ... So we nedd an interface for all, that's nearly impossible in a short time period. I use most often the Java->Matlab interface, this is nothing special only the adapted JMatLink connection. ... and another advantage of holiday and weekeend ... i can write really long e-mails :-) Kind regards, Joerg > C. The OpenSource community has made some small, useful steps in this > direction. They now wish to pool their efforts and produce a single point > of contact for their own development and to show to the world. This does > NOT necessarily mean a single program. IMO it is much more likely to mean > an infrastructure on which a variety of operations can be carried out > ("glueware"?). They wish to create a project at SF which leads to: > - active constructive discussion > - agreed representation of objects > * molecules, atoms, fragments, etc. > * descriptors > * properties > - creation, cataloguing, annotating, high-quality information objects: > * dictionaries > * properties (e.g. of atoms) > * datasets > - creation, cataloguing, annotation of algorithms related to QSAR > * chemical perception > * statistics, optimisation, etc > - creation of software: > * as toolkit components > * as demonstrators of the *quality* of the system > > That is as far as I have got... > > I think it's important to be inclusive and I would therefore suggest that > we review the current OpenSource efforts in this area. My knowledge extends to: > - CDK, etc. > - JOELib > - OpenBabel > - Weka > - Nina's work (does this have a label?) > > In projects of this sort everyone has something to contribute and also > something to give up. For example I did a lot of work on visual display of > CML (Jumbo3) - and some of this functionality is not provided by other > sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for > display. JUMBO4.3 has now developed in a more structured form as a flexible > XML DOM and Tools library which can be reconfigured easily and rapidly. It > is component based rather than application based. > > I suggest starting not with deciding what program to write but with what > the components of a QSAR system are and then deciding what who wants to be > involved, we have got and setting some realistic scope to what is achievable. > > Best > > P. > Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |