From: Peter Murray-R. <pm...@ca...> - 2004-04-17 13:05:50
|
At 09:52 17/04/2004 +0200, Joerg Wegner wrote: >Dear Nina Nikolova, >Dear All, > >please reply also to the JOELib mailing list and ... i've already >published three papers about QSAR and our group has it's main focus on >data mining and optimization algorithms, so i think i've some experience >in this area, too. >http://www-ra.informatik.uni-tuebingen.de/ It seems there is general agreement that an SF project in this area is valuable and I'll make a few comments which I hope are helpful. Please ignore if they aren't. A. Current QSAR practice has severe problems. They include: - almost all codes are closed. Many are not free. - it is impossible to repeat any experiment. Therefore QSAR ceases to be scientific but relies on reputation, trust and power - the objects used are badly designed, irreproducible and have variable interpretation - data selection is arbitrary. There are few (no?) standard test sets. It is impossible to verify whether data have be modified consciously or unconsciously to increase apparent success - algorithms are closed, even if the data are well defined. B. The mainstream QSAR community is not taking effective steps to remedy the errors. Our current group believes that through an OpenSource approach we can catalyse a change in thinking and practice. We do this by creating a system and practice that demonstrates the increased **quality** available through OpenSource. IMO quality is the most important - more so than platform, language, ease of use, performance, etc. If it is easier and faster to create more garbage on every platform what have we achieved? C. The OpenSource community has made some small, useful steps in this direction. They now wish to pool their efforts and produce a single point of contact for their own development and to show to the world. This does NOT necessarily mean a single program. IMO it is much more likely to mean an infrastructure on which a variety of operations can be carried out ("glueware"?). They wish to create a project at SF which leads to: - active constructive discussion - agreed representation of objects * molecules, atoms, fragments, etc. * descriptors * properties - creation, cataloguing, annotating, high-quality information objects: * dictionaries * properties (e.g. of atoms) * datasets - creation, cataloguing, annotation of algorithms related to QSAR * chemical perception * statistics, optimisation, etc - creation of software: * as toolkit components * as demonstrators of the *quality* of the system That is as far as I have got... I think it's important to be inclusive and I would therefore suggest that we review the current OpenSource efforts in this area. My knowledge extends to: - CDK, etc. - JOELib - OpenBabel - Weka - Nina's work (does this have a label?) In projects of this sort everyone has something to contribute and also something to give up. For example I did a lot of work on visual display of CML (Jumbo3) - and some of this functionality is not provided by other sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for display. JUMBO4.3 has now developed in a more structured form as a flexible XML DOM and Tools library which can be reconfigured easily and rapidly. It is component based rather than application based. I suggest starting not with deciding what program to write but with what the components of a QSAR system are and then deciding what who wants to be involved, we have got and setting some realistic scope to what is achievable. Best P. Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg W. <we...@in...> - 2004-04-19 16:54:07
|
Greetings, > well, then the first question - what about Weka performance ? > (It eats a lot of memory when working with large data sets) > > > R is similar and a long time ago i've used the interface under Java very > > shortly ... we're matlab based ! > > i like using matlab and it is quite usefull; > but matlab itself is not open source , it could be obstacle Same as for representation of molecules. WHY? 1111. Weka splits all into attributes and instances, also nominal and numeric attributes. This causes memory, but is quite usefull, because it is not clear from a series: 1,NaN,3,4,2,1 if this is a nominal classification or a numeric regression problem ! I understand your point, in fact i've implemented a DescriptorMatrix class for JOELib (joelib.desc.data) which holds only the matrix with descriptor names and molecules, but this causes a lot of problems for algorithm development, because the interface can not distinguish the above series by default. I used simply a matrix2weka mapping tool. That's why a student of mine developed a second interface was implemented to have both possibilities, which holds also the molecules in a weka related context directly. For my actual problem i need a wild mix between nominal and numeric and it is more clearly if the attributes holds this information already, so i must not implement always helper classes for both cases. 2222. In general it is usefull to cache data sets (already available as DescriptorMatrixCache) to avoid multiple entries in memory. The cross-validation can be catched from the cached versions. Furthermore optimization algorithms needs a common DB analogue interface or caching mechanism to load required data set s only once (singleton class interface) 3333. It is not possible to compete with fast matrix operations, there R or Matlab should be used, there specific optimized code is needed. Java has: Jama and COLT and some Weka-Add-Ons uses them, but this can never be compared to assembler optimized code. Kind regards, Joerg Joerg Kurt Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Egon W. <eg...@sc...> - 2004-04-17 15:54:20
|
On Saturday 17 April 2004 15:02, Peter Murray-Rust wrote: > I suggest starting not with deciding what program to write but with what > the components of a QSAR system are and then deciding what who wants to be > involved, we have got and setting some realistic scope to what is > achievable. Agreed. This is why we need to set up a SF project where we can write these things down. Here's a list: - building a molecule database - read from file/internet - draw yourself one by one, or insert from smiles - browsing the database with 2D and possible 3D structures - associate activities/properties with those molecules - preprocessing - get mathematical (or other) descriptions of the molecules in the database - selection of wanted descriptions - ability to use external programs for this - descriptor value preprocessing - statistical analysis of the database (outliers, diversity, etc) - model building - chosing method, and method parameters - model validation - visual validation -> plots - statistical validation I've requested a new SF project ('qsar') yesterday after getting positive reactions to my proposal earlier. Joerg, I did not direct you personaly yet, because I vaguely remembered you stating to be on holiday (?), but I might very well be confused here... I see JOELib as an important part of the new program: it has many descriptors implemented, already uses CML2 for storing results, and has an interface to Weka. I also see an important part for CDK: 2D editing/display is a very important feature here. And, I expect, some descriptors will be implemented in CDK later this year, though this will likely not conflict with those in JOELib. The reason why I propose CDK's core classes must be obvious. Hopefully, the QSAR SF project will be approved early next week, and then I will start adding requirements, analyses, etc to documentation, hopefully together with the others interested. Then we will see how the available OS parts fit together. Egon |
From: Joerg W. <we...@in...> - 2004-04-17 17:25:49
|
Hi all, > > I suggest starting not with deciding what program to write but with what > > the components of a QSAR system are and then deciding what who wants to be > > involved, we have got and setting some realistic scope to what is > > achievable Of course, i like QSAR .. but time is rare and who will implement things ... you know that's my default comment ... Egon i've read your mail ... and yes i'm still in holiday ... and i do check e-mails and i work since 3 years on QSAR ... so holiday means i can read fantasy books and can do thinks i like, e.g. read some QSAR papers !:-) Holiday and spare-time are some curious things .. aren't they :-) > It seems there is general agreement that an SF project in this area is > valuable and I'll make a few comments which I hope are helpful. Please > ignore if they aren't. I do not agree to open an own project, there is much code out there: Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka interface (Xalopy or what was the correct name ?) I think, we do not want to invent an new data mining standard, such discussions are more usefull for the Weka mailing list and all avaliable Matlab algorithm providers (toolboxes !!!) ... ... and such discussions are not new (see Weka mailing-list) !!! I think we are interested to provide the best useable appraoch with implemented algorithms available, so let's use the already available ones and extend them !!! IMHO: !!! The problem is not the missing 'data mining'-standard. The problem is the misuse of 1. a general molecular-structure-coding with these standard algorithms !!! 2. applying these algorithms correctly So let's focus this problem first !!! This is a problam of CDK and JOELib and only if we have solved this, we can solve the next one. Furthermore i will publish in the next time: - the extended Weka interface - the MaximumCommonSubstructure (MCS) algorithms - The Metric-Interface is still available and is used by the AtomPair- descriptor Weka-Clusterers with Molecular-Metrics are planned and will be implemented next. The Cluster-Matlab-Molecule connection is to difficult at the moment, because the similarity metric must be coded under Matlab or we use indices ... So again, i'm using a lot of interfaces and i do not like another one !!! Will it not be easier to add CDK- and JOELib-PlugIns. Do not make the algorithms to easy for chemists, probably they think hypothesis-testing is an easy tasks and the molecular structure is the most important thing ... IMHO ... that's badly wrong !!! So force them to read the data mining/interface manual carefully. Descriptor dependencies are NOT all linear 2D dependencies as already excellently mentioned by Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? That's mainly irrelevant and misleading ! A 2D plot is only one possibility for the model quality, and not always the best one !!! > A. Current QSAR practice has severe problems. They include: > - almost all codes are closed. Many are not free. Exact: Descriptors: Dragon, MolConnZ, ... Algorithms: Often unpublished code with hiding most of the paramaters, also important ones > - it is impossible to repeat any experiment. Therefore QSAR ceases to be > scientific but relies on reputation, trust and power > - the objects used are badly designed, irreproducible and have variable > interpretation > - data selection is arbitrary. There are few (no?) standard test sets. It > is impossible to verify whether data have be modified consciously or > unconsciously to increase apparent success > - algorithms are closed, even if the data are well defined. Agree fully, four times ! Oh, i've some nice slides i can present for these points ... :-) > B. The mainstream QSAR community is not taking effective steps to remedy > the errors. Our current group believes that through an OpenSource approach > we can catalyse a change in thinking and practice. We do this by creating a > system and practice that demonstrates the increased **quality** available > through OpenSource. IMO quality is the most important - more so than > platform, language, ease of use, performance, etc. If it is easier and > faster to create more garbage on every platform what have we achieved? 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that not everybody like this theorem (still apriori) ... BUT ... now we have a huge amount of algorithms ... which one to pick ? It's 'easy' to find one algorithm and one feature set to explain one data set perfectly ! 2. And we are not all algorithm developers, so use the existing libraries which the main-stream user can use. There is still enough room to make errors, also if we must not reimplement algorithms !!! 3. A QSAR framework is not easy, because there are a lot of different opinions: 3.1. how to present structures, e.g. CDK<->JOELib 3.2. models (hypothesis building algorithms) are really abstract and do not forget the nested and highly interesting meta algorithms with recursive character, so let's forget the C++ libraries and concentrate on the Java and Matlab (Java GUI) libraries (R?) with their flexible reflection mechanism! 3.3. results ... uhhh ... cross-validation, feature selection, data set splitting ? Do not forget that we talk about molecular structures, so ... 3.4. Big descriptor files with normalized descriptors, missing values, if instable numeric descriptors or they depend on molecule size, ... 3.5. Are we working in memory or on files ??? For hypothesis building we are hopefully are working on memory, but the preprocessing steps do not underly this restriction. Sorry, CDK for descriptors is not obvious to me, please explain. As you can mention, i do not agree for several reasons, as already discussed previously, e.g. missing atom typer and missing substructure search ! (molecular-structure-coding ... is restricted to applied expert systems) Why do we need again a new project, do we not have enough interface maintenance 'problems' with the actual projects !? 1. I think the standard should be a file format or CML, but this does not help at all, this can only save time by using more space ! You-Know: Time-Space-Complexity 2. Often on-the-fly calculations are required, so this will require JOELib or CDK or external JOELib module (which exists already: Corina, Petra, XLogP,...) So we need a molecule data structure, so which one to use ? Again implement a new interface ? Why ? I can't see the advantage ? 2.1. Interface to Molecules: - JOELib (available) - CDK (available) - Ghemical/Mopac (available in JOELib) - OpenBabel (JNI, same object structure as JOELib, but is this usefull ?) - Tinker 2.2. Interface to data mining packages - Weka (available in JOELib/JCompChem) - JavaNNS (SNNS sucessor, available in JOELib/JCompChem) - LibSVM (available in JOELib/JCompChem) - Matlab and it's 1001 free-packages (available in JOELib/JCompChem) - Yale uses Weka - Data mining API - ... to much such stuff ... all mostly incompatible ... let's use Weka, that's the most serious used OpenSource approach. Data Miners will implement their algorithms for it, we can use them ! - let's use Matlab and/or R 3. Visualization: 3.1. Molecules: Can be done with CDK and with JOELib also highlighted SMARTS substructures: 2D layout CDK 3D layout JOElib (Corina, Ghemical, orYourInterface) 3.2. Data: what, histograms, plots, 3D plots , ... no interest to implement such things, that's boring and does not help at all, because Weka, Matlab, R have all their own tools and which one do you prefer ? What's with independent packages, like libSVM, our JavaNNS (SNNS successor), ... So we nedd an interface for all, that's nearly impossible in a short time period. I use most often the Java->Matlab interface, this is nothing special only the adapted JMatLink connection. ... and another advantage of holiday and weekeend ... i can write really long e-mails :-) Kind regards, Joerg > C. The OpenSource community has made some small, useful steps in this > direction. They now wish to pool their efforts and produce a single point > of contact for their own development and to show to the world. This does > NOT necessarily mean a single program. IMO it is much more likely to mean > an infrastructure on which a variety of operations can be carried out > ("glueware"?). They wish to create a project at SF which leads to: > - active constructive discussion > - agreed representation of objects > * molecules, atoms, fragments, etc. > * descriptors > * properties > - creation, cataloguing, annotating, high-quality information objects: > * dictionaries > * properties (e.g. of atoms) > * datasets > - creation, cataloguing, annotation of algorithms related to QSAR > * chemical perception > * statistics, optimisation, etc > - creation of software: > * as toolkit components > * as demonstrators of the *quality* of the system > > That is as far as I have got... > > I think it's important to be inclusive and I would therefore suggest that > we review the current OpenSource efforts in this area. My knowledge extends to: > - CDK, etc. > - JOELib > - OpenBabel > - Weka > - Nina's work (does this have a label?) > > In projects of this sort everyone has something to contribute and also > something to give up. For example I did a lot of work on visual display of > CML (Jumbo3) - and some of this functionality is not provided by other > sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for > display. JUMBO4.3 has now developed in a more structured form as a flexible > XML DOM and Tools library which can be reconfigured easily and rapidly. It > is component based rather than application based. > > I suggest starting not with deciding what program to write but with what > the components of a QSAR system are and then deciding what who wants to be > involved, we have got and setting some realistic scope to what is achievable. > > Best > > P. > Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Egon W. <eg...@sc...> - 2004-04-17 19:46:18
|
On Saturday 17 April 2004 19:25, Joerg Wegner wrote: > > > I suggest starting not with deciding what program to write but with > > > what the components of a QSAR system are and then deciding what who > > > wants to be involved, we have got and setting some realistic scope to > > > what is achievable > > Of course, i like QSAR .. but time is rare and who will implement things > ... you know that's my default comment ... > > Egon i've read your mail ... and yes i'm still in holiday ... and i do > check > e-mails and i work since 3 years on QSAR ... so holiday means i can read > fantasy books and can do thinks i like, e.g. read some QSAR papers !:-) > Holiday and spare-time are some curious things .. aren't they :-) :) > > It seems there is general agreement that an SF project in this area is > > valuable and I'll make a few comments which I hope are helpful. Please > > ignore if they aren't. > > I do not agree to open an own project, there is much code out there: > Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka > interface (Xalopy or what was the correct name ?) A new project does not mean that available pieces cannot be used... > I think, we do not want to invent an new data mining standard, such > discussions are more usefull for the Weka mailing list and all > avaliable Matlab algorithm providers (toolboxes !!!) ... Not everyone prefer to work with Matlab... Matlab is not free, neither is the PLS Toolbox... What's the URL for Weka? > ... and such discussions are not new (see Weka mailing-list) !!! > I think we are interested to provide the best useable appraoch > with implemented algorithms available, so let's use the already > available ones and extend them !!! Absolutely. If that has not been clear so far, I prefer to use existing stuff as much as possible, but I do prefer some tools over others, which is in general too, so we need to develop wrappers using a unified interface... > IMHO: > !!! The problem is not the missing 'data mining'-standard. The problem > is the misuse of > 1. a general molecular-structure-coding with these standard algorithms !!! > 2. applying these algorithms correctly > So let's focus this problem first !!! Not everyone agrees on how methods/algorithms should be applied... but I agree that there is plenty of weird use of methods in some QSAR research... I think that providing people with an easy to use, clear and well defined program will make it much easier to teach others what things should be taken into account when making models... > This is a problam of CDK and JOELib > and only if we have solved this, we can solve the next one. > Furthermore i will publish in the next time: > - the extended Weka interface Looking forward to reading that... > - the MaximumCommonSubstructure (MCS) algorithms Is this an improved algorithm, or similar to that in CDK? > - The Metric-Interface is still available and is used by the AtomPair- > descriptor > Weka-Clusterers with Molecular-Metrics are planned and will be > implemented next. The Cluster-Matlab-Molecule connection is to difficult > at the moment, because the similarity metric must be coded under Matlab > or we use indices ... Not sure what you mean here... > So again, i'm using a lot of interfaces and i do not like another one !!! Fine. I don't think we will need to reinvent what you did. I'm, and I guess others too, are fine with using interfaces similar or identical to yours... > Will it not be easier to add CDK- and JOELib-PlugIns. > Do not make the algorithms to easy for chemists, probably they think > hypothesis-testing is an easy tasks and the molecular structure is the > most important thing ... IMHO ... that's badly wrong !!! Mmm... not sure I agree here... chemists are our target... likely even biologists (no offense... :) > So force them > to read the data mining/interface manual carefully. Ok, can you explain me what the goal is here? I.e. what should they learn from understanding the interface? > Descriptor dependencies > are NOT all linear 2D dependencies as already excellently mentioned by > Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? > That's mainly irrelevant and misleading ! A 2D plot is only one > possibility for the model quality, and not always the best one !!! What kind of 2D are you talking about here? > > A. Current QSAR practice has severe problems. They include: > > - almost all codes are closed. Many are not free. > > Exact: > Descriptors: Dragon, MolConnZ, ... > Algorithms: Often unpublished code with hiding most of the paramaters, > also important ones > > > - it is impossible to repeat any experiment. Therefore QSAR ceases to be > > scientific but relies on reputation, trust and power > > - the objects used are badly designed, irreproducible and have variable > > interpretation > > - data selection is arbitrary. There are few (no?) standard test sets. > > It is impossible to verify whether data have be modified consciously or > > unconsciously to increase apparent success > > - algorithms are closed, even if the data are well defined. > > Agree fully, four times ! > Oh, i've some nice slides i can present for these points ... :-) > > > B. The mainstream QSAR community is not taking effective steps to remedy > > the errors. Our current group believes that through an OpenSource > > approach > > > we can catalyse a change in thinking and practice. We do this by > > creating a > > > system and practice that demonstrates the increased **quality** > > available > > > through OpenSource. IMO quality is the most important - more so than > > platform, language, ease of use, performance, etc. If it is easier and > > faster to create more garbage on every platform what have we achieved? > > 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that > not > everybody like this theorem (still apriori) ... BUT ... now we have a huge > amount of algorithms ... which one to pick ? It's 'easy' to find one > algorithm and one feature set to explain one data set perfectly ! > > 2. And we are not all algorithm developers, so use the existing libraries > which the main-stream user can use. There is still enough room to make > errors, also if we must not reimplement algorithms !!! > > 3. A QSAR framework is not easy, because there are a lot of different > opinions: Correct. Hence the proposed the new SF project to discuss and implement these things... > 3.1. how to present structures, e.g. CDK<->JOELib > 3.2. models (hypothesis building algorithms) are really abstract and do > not > forget the nested and highly interesting meta algorithms with > recursive > character, so let's forget the C++ libraries and concentrate on the > Java and Matlab (Java GUI) libraries (R?) with their flexible > reflection > mechanism! > 3.3. results ... uhhh ... cross-validation, feature selection, data set > splitting ? > Do not forget that we talk about molecular structures, so ... > 3.4. Big descriptor files with normalized descriptors, missing values, if > instable numeric descriptors or they depend on molecule size, ... > 3.5. Are we working in memory or on files ??? For hypothesis building we > are hopefully are working on memory, but the preprocessing steps do > not underly this restriction. Much of this has already be discussed in the thread. True nevertheless. > Sorry, CDK for descriptors is not obvious to me, please explain. As you > can mention, i do not agree for several reasons, as already discussed > previously, e.g. missing atom typer and missing substructure search ! CDK *has* substructure search, implemented in a rather flexible way. > (molecular-structure-coding ... is restricted to applied expert systems) > > Why do we need again a new project, As said above, a new project does not equal starting from scratch. > do we not have enough interface > maintenance 'problems' with the actual projects !? > 1. I think the standard should be a file format or CML, but this does not > help at all, this can only save time by using more space ! > You-Know: Time-Space-Complexity I have not seen the Heisenberg relation for this yet... > 2. Often on-the-fly calculations are required, so this will require > JOELib or CDK or > external JOELib module (which exists already: Corina, Petra, XLogP,...) > So we need a molecule data structure, so which one to use ? > Again implement a new interface ? Why ? I can't see the advantage ? See thread. > 2.1. Interface to Molecules: > - JOELib (available) > - CDK (available) > - Ghemical/Mopac (available in JOELib) > - OpenBabel (JNI, same object structure as JOELib, but is this > usefull ?) > - Tinker > > 2.2. Interface to data mining packages > - Weka (available in JOELib/JCompChem) > - JavaNNS (SNNS sucessor, available in JOELib/JCompChem) > - LibSVM (available in JOELib/JCompChem) > - Matlab and it's 1001 free-packages (available in JOELib/JCompChem) Too bad Matlab itself is not... > - Yale uses Weka > - Data mining API Let's use a chemometrics API. :) I have no idea what a data mining API is... data mining is a rather vague term... like chemometrics API. > - ... to much such stuff ... all mostly incompatible ... let's use > Weka, that's the most serious used OpenSource approach. > Data Miners will implement their algorithms for it, we can use them ! > - let's use Matlab and/or R Let's have that plugable. So that anyone can choose whatever program they like. > 3. Visualization: > 3.1. Molecules: Can be done with CDK and with JOELib also highlighted > SMARTS substructures: > 2D layout CDK > 3D layout JOElib (Corina, Ghemical, orYourInterface) > 3.2. Data: what, histograms, plots, 3D plots , ... > no interest to implement such things, that's boring and does not > help at all, because Weka, Matlab, R have all their own tools > and which one do you prefer ? > What's with independent packages, like libSVM, our JavaNNS > (SNNS successor), ... > So we nedd an interface for all, that's nearly impossible in a short > time period. > I use most often the Java->Matlab interface, this is nothing special > only the adapted JMatLink connection. > > ... and another advantage of holiday and weekeend ... i can write really > long e-mails :-) Thanx for this analysis. And don't spend to much of your holiday on these kinds of emails... though it is difficult not to respond. :) > Kind regards, Joerg Have a nice continuation of you holiday! Egon |
From: rich a. <che...@ya...> - 2004-04-21 15:03:32
|
I agree that a common method for the representation of molecular objects is critical for the development of portable and verifiable cheminformatics protocols. A core principle of object-oriented design is that designs are most reusable when you program to interfaces, not implementations. I would propose that any discussion of a QSAR framework should take into consideration the need to first define Java interfaces for core objects such as Atom and Molecule. The QSAR framework would be useful to the greatest number of developers if each developer is free to provide their own implementation of the core interfaces that will work without modification in the QSAR framework. Defining these interfaces means that the irreducible core functionality of Molecule, Atom, etc. with which the framework will neeed to work must be decided on. The advantage of this approach is true design reuse. Because the QSAR framework only knows about Java interfaces, all a developer needs to do to use all of the functionality of the framework is to provide an implementation of those interfaces. Of course, reference implementations should be provided by the framework as well. I've taken this approach in a cheminformatics framework called "Octet" (http://octet.sourceforge.net) and in a 2-D molecular visualization framework called "Structure" (http://structure.sourceforge.net). The approach in these frameworks differs significantly from both JOELib and CDK in that a developer is never required to use my reference implementations of Molecule or Atom. For example, it is possible to provide performance-optimized implementations of these interfaces that would be suitable for large numbers of molecules, or the rapid constrution of molecules. The framework only knows about interfaces, and this is the key to code reuse. I would be willing to provide any code and/or experiences from these projects to the development of a QSAR framework. cheers, rich Peter Murray-Rust <pm...@ca...> wrote: C. The OpenSource community has made some small, useful steps in this direction. They now wish to pool their efforts and produce a single point of contact for their own development and to show to the world. This does NOT necessarily mean a single program. IMO it is much more likely to mean an infrastructure on which a variety of operations can be carried out ("glueware"?). They wish to create a project at SF which leads to: - active constructive discussion - agreed representation of objects * molecules, atoms, fragments, etc. * descriptors * properties - creation, cataloguing, annotating, high-quality information objects: * dictionaries * properties (e.g. of atoms) * datasets - creation, cataloguing, annotation of algorithms related to QSAR * chemical perception * statistics, optimisation, etc - creation of software: * as toolkit components * as demonstrators of the *quality* of the system --------------------------------- Do you Yahoo!? Yahoo! Photos: High-quality 4x6 digital prints for 25¢ |
From: Christoph S. <c.s...@un...> - 2004-04-22 08:10:20
|
Rich, it was interersting to learn about your projects, and of course you=20 point about interfaces is a valid one. We had this discussion for CDK again and again and we are likely to move=20 to core class (Atom, Bond, Molecule, etc.) interfaces in the future. With now three Chemoinformatics Java frameworks on Sourceforge and this=20 interesting discussion about the QSAR project, we have a decent chance=20 to agree on a well-defined interface for those core classes. Cheers, Chris --=20 Dr. rer. nat. habil. Christoph Steinbeck (c.s...@un...) Groupleader Junior Research Group for Applied Bioinformatics Cologne University BioInformatics Center (http://www.cubic.uni-koeln.de) Z=FClpicher Str. 47, 50674 Cologne Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-7786 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. rich apodaca wrote: > I agree that a common method for the representation of molecular=20 > objects is critical for the development of portable and verifiable=20 > cheminformatics protocols. > =20 > A core principle of object-oriented design is that designs are most=20 > reusable when you program to interfaces, not implementations. > =20 > I would propose that any discussion of a QSAR framework should take int= o=20 > consideration the need to first define Java interfaces for core objects= =20 > such as Atom and Molecule. The QSAR framework would be useful to the=20 > greatest number of developers if each developer is free to provide thei= r=20 > own implementation of the core interfaces that will work without=20 > modification in the QSAR framework. Defining these interfaces means tha= t=20 > the irreducible core functionality of Molecule, Atom, etc. with which=20 > the framework will neeed to work must be decided on. > =20 > The advantage of this approach is true design reuse. Because the QSAR=20 > framework only knows about Java interfaces, all a developer needs to do= =20 > to use all of the functionality of the framework is to provide an=20 > implementation of those interfaces. Of course, reference implementation= s=20 > should be provided by the framework as well. > =20 > I've taken this approach in a cheminformatics framework called "Octet"=20 > (http://octet.sourceforge.net <http://octet.sourceforge.net/>) and in a= =20 > 2-D molecular visualization framework called "Structure"=20 > (http://structure.sourceforge.net <http://structure.sourceforge.net/>).= =20 > The approach in these frameworks differs significantly from both JOELib= =20 > and CDK in that a developer is never required to use my reference=20 > implementations of Molecule or Atom. > =20 > For example, it is possible to provide performance-optimized=20 > implementations of these interfaces that would be suitable for large=20 > numbers of molecules, or the rapid constrution of molecules. The=20 > framework only knows about interfaces, and this is the key to code reus= e. > =20 > I would be willing to provide any code and/or experiences from these=20 > projects to the development of a QSAR framework. > =20 > cheers, > rich |
From: Joerg K. W. <we...@in...> - 2004-04-24 09:06:45
|
Hi all, i agree. A common molecule interface would be fantastic. I will have a look at your code. Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg K. W. <we...@in...> - 2004-04-26 07:21:30
|
Hi, I've had a short look and i'm missing some things functionalities in octet: - i would prefer Node and Edge objects as Atom and Bond base - i would prefer a general NodeKey, EdgeKey, MoleculeKey, RingKey object as labelling the attributed molecular graph both things are required for general graph algorithms, for the keys a factory pattern could/should be used, especially for assigning default labels. This avoids calculating e.g. a ring search twice by using: if(!mol.hasKey(myRingSearchKey))mol.calculateRingSearch() - The AtomPair is ambigous, there exists a descriptor with an additional distance parameter, here you are using always one. Hashing is important here. - Force Copy/Clone/Hash-methods. - The reader should provide, readAsString, readToMoleculeObject, so we can catch corrupted file entries. Don't ask me why there are such a lot of corrupted entries, but they exists - Add MoleculeIOException to read/write, to catch these corrupted entries, this will us enable to write skip files - A general SubstructureSearch object would be fine, also a UniqueSubstructureSearch object or a transformer object. - General descriptor objects are missing completely, but they can be handled by the hashed MoleculeKey objects, but eventually we distinguish between keys which can handle only one object (hashed) and keys which can handle multiple objects, so we need a GeneralPropertyHandler which accepts single and multiple entries by key. - For descriptors IO helper classes are required, which have read(IOType) and write(IOType) Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |