[octet-devel] Re: [Cdk-devel] QSAR
Status: Alpha
Brought to you by:
r_apodaca
|
From: rich a. <che...@ya...> - 2004-04-28 02:03:41
|
Thanks for your comments, Peter. I'm especially interested in your comments on cml. I've been watching cml at a distance for some time, but I didn't realize you had defined interfaces for molecule and atom behavior. Could you more precisely point me to where these interfaces are? I visited the link you sent but wasn't able to find them.
I'm not very familiar with xml, but if I understand correctly, a DOM is used to produce an in-memory representation of the structure of an XML document. Minimally, it provides an exact representation of the content of the XML document. If I'm correct so far, then I imagine that a CML DOM provides an exact representation of the structure of a CML document.
In addition to providing an interface to access the data, what behaviors do the CML interfaces define for model-level objects like Atom and Molecule? To me, an example of pure Atom data would be an atom label property, whereas an example of Atom behavior is the capability to report what bonding systems an Atom belongs to and what Atoms it is a neighbor of. The choice of behavior is critical: too much functionality and the interface becomes bloated and hard to understand - too little and developers are frustrated at how much work it takes to do simple things. I'm very interested in knowing what the right balance is.
It sounds like the approach you've taken in using interfaces is similar to mine. Like you, I am keenly interested in taking advantage of the rich functionality of CDK and JOELib. As a first pass, I've been working on a two-way adapter class for CDK. Its definition looks something like this:
public class CDKMolecule extends org.openscience.cdk.Molecule
implements net.sourceforge.octet.molecule.Molecule
{
// override org.openscience.cdk.Molecule methods where appropriate
// implement net.sourceforge.octet.molecule.Molecule interface
}
The advantage here is that a CDKMolecule can be used from within either CDK or Octet without the need for a conversion step. I plan to do the same thing for joelib.molecule.JOEMol.
In particular, it would be helpful to use the file format read/write capabilities of CDK. The problem I'm currently facing is that IO classes such as org.openscience.cdk.io.MDLReader provide their own instance of org.openscience.cdk.Molecule that is created during a call to read(). If this method used an instance of org.openscience.cdk.Molecule passed into the read() method instead, then I could just pass in my CDKMolecule, and the reader would not be the wiser. What would be the consequences of modifying the IO classes to allow for this?
With regard to directly supporting CML, I'm interested in trying my hand at it with Octet. The Octet model for bonding is somewhat different from the other Java cheminformatics packages I've seen in that it directly supports multicenter, multielectron bonding arrangements. So, the bonding arrangement of ferrocene, benzyne, borane clusters, or the homotropylium cation are handled exactly the same way as those of hexane. This implementation is based on a paper by Dietz (JCICS 1995, 35, 787). What are your thoughts on CML providing the syntax necessary to represent these "non-traditional" kinds of bonding arrangements?
cheers,
rich
Peter Murray-Rust <pm...@ca...> wrote:
At 08:03 21/04/2004 -0700, rich apodaca wrote:
>I agree that a common method for the representation of molecular objects
>is critical for the development of portable and verifiable cheminformatics
>protocols.
>
First - I welcome new contributors in the OpenSource molecular sciences domain!
>A core principle of object-oriented design is that designs are most
>reusable when you program to interfaces, not implementations.
I agree fully. In practice this is difficult to achieve. The areas where I
have found it work best are SUN's Java libraries, SAX (which we developed
as an interface) and DOM.
>
>I would propose that any discussion of a QSAR framework should take into
>consideration the need to first define Java interfaces for core objects
>such as Atom and Molecule. The QSAR framework would be useful to the
>greatest number of developers if each developer is free to provide their
>own implementation of the core interfaces that will work without
>modification in the QSAR framework. Defining these interfaces means that
>the irreducible core functionality of Molecule, Atom, etc. with which the
>framework will neeed to work must be decided on.
I agree.
May I suggest XML as the approach to define the functionality. We now have
opensource tools (JUMBO4.3, http://wwmm.ch.cam.ac.uk/moin) which
automatically generate DOM interfaces and implementations for Java, C++,
python and F90 for any XML schema. We have done this for CML and can
automatically do this for any sub or superset of CML within minutes. We
have a pseudocode language for adding non-DOM functionality to DOM objects
so that the whole of the code can be represented in XML. An advantage of
doing this is that documentation, examples, rendering and behaviour are
much easier to maintain and that multiple target languages can be used. The
advantage of XML over UML is that it is much more widely used and tools are
free
>
>The advantage of this approach is true design reuse. Because the QSAR
>framework only knows about Java interfaces, all a developer needs to do to
>use all of the functionality of the framework is to provide an
>implementation of those interfaces. Of course, reference implementations
>should be provided by the framework as well.
Agreed. We do this for CML and for the additional non-DOM functionality.
Thus we have:
CMLMolecule (interface)
MoleculeImpl (implementation - can be provided by anyone)
These have automatically generated methods such as:
void Molecule.setTitle()
CMLAtom AtomArray.getAtomChild(int serial)
there are also factory methods for generation so that object construction
can be provided by different developers.
To provide additional functionality we provide wrappers such as:
MoleculeTool (interface)
MoleculeToolImpl (impl).
MoleculeTool MoleculeToolImpl.getMoleculeTool(CMLMolecule)
doublr MoleculeTool.getMolecularWeight()
This has great reusability - we currently use CDK methods within the Tools
(rather than write our own). However we could easily add or replace JOELib
methods without changing the user code. Libraries can be linked at runtime.
Indeed the code could even poll the classlibraries to see which can be
resolved.
>
>I've taken this approach in a cheminformatics framework called "Octet"
>(http://octet.sourceforge.net) and in a 2-D
>molecular visualization framework called "Structure"
>(http://structure.sourceforge.net). The
>approach in these frameworks differs significantly from both JOELib and
>CDK in that a developer is never required to use my reference
>implementations of Molecule or Atom.
Thanks - I have had a look at the site and agree with the design. Please
take the following comments as constructive.
- a. If you are intending to write your own code there will be a huge
amount. I did essentially this for CML1.0 and submission to the OMG. It
involved over 1000 method interfaces. You will soon find you have a great
many to maintain.
- you will need to provide a reference implementation for each method to
provide that the system is self-consistent. You may be able to borrow some
functionality from CDK or JOELib - that's what I do.
- you will need to convince collaborators of the value of your interface
over other available ones. I'm neutral on this, but I would urge that any
emerging interfaces support CML.
>
>For example, it is possible to provide performance-optimized
>implementations of these interfaces that would be suitable for large
>numbers of molecules, or the rapid constrution of molecules. The framework
>only knows about interfaces, and this is the key to code reuse.
>
>I would be willing to provide any code and/or experiences from these
>projects to the development of a QSAR framework.
>
I suspect this message is therefore on the wrong list and should be sent to
qsar-devel.
P.
Note I have not replied to the crossposted lists in the original mail
Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069
---------------------------------
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs |