[Joelib-devel] Suggested modified conversion framework

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Message: 1
Reply-To: "Chris Morley" <c.m...@ga...>
From: "Chris Morley" <c.m...@ga...>
To: <ope...@li...>
Date: Sun, 4 Jan 2004 15:09:00 -0000
Subject: [Open Babel] Suggested modified conversion framework

"Geoff Hutchison" wrote (some time back)

 >> So as a self-proclaimed "maintainer" of the project, I have to keep
 >> some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, 3.0?,
 >> etc. releases.

....

 >>    (What new formats are needed? What new features? What architecture
 >> changes are needed?)
 >> 3) Discussions on plans/roadmaps for things that don't go into the 2.0
 >> release.
 >>

Here are some suggestions for longer-term mods to the conversion process
in OpenBabel. I feel that, although they may not be backward compatible,
these features would be desirable to provide flexibility and maintainability
for the future. Most have been previously discussed.

- A clearer separation of functions - the  chemistry needs to be more
separated from the conversion. (e.g. OBMol should not be where the
input and output file formats are stored). The FileFormat class could
be beefed up to do this.

- Each format  needs to be self contained. A new format should not
require any changes  in old code. (This is what abstract base classes
are for.) In a Windows system you might want to have each as a
precompiled DLL - a plugin - which would ease upgrading and allow
installation only of relevant formats.  I guess something similar is
possible in Unix. Even without this feature the formats would be
dynamic - information about them and their options would be retieved
at run time.

- The user interface for conversion would have a clear interface to the
conversion process itself to allow alternatives (GUIs etc). This would
include handling of formats (file extensions, options) in a dynamic way.

- The input and output routines need to be more aware of the conversion
process so that they can adjust. Examples are the previously discussed
need deal on-the-fly with generated molecules during CML input, and
the need for a conditional <cml>...</cml> wrapper during output.

- The conversion framework needs to handle more than just OBMol.
I'm sure use would be made of the facility to convert different types of
molecule, sets of molecules(conformers?), reactions, sets of reactions, etc.

- There should be more support for non-expert users. There is a big
activation barrier to using the program, which is not appropriate if it is
to be used to just convert the format of some files. I realise precompiled
code is not the Unix way, but Windows (and Mac?) users expect it.
It would be nice to support it on appropriate platforms, while keeping
a single version of the source code.

I've put together some working code which implements all of these
features, described in more detail at
http://www.arcl02.dsl.pipex.com/OB/obframework.html

It is written so that it can be deployed as separate DLLs containing:
    the main chemistry (the code is almost as at present);
    the conversion process (a new class);
    one or more formats (can be the existing code with a small wrapper);

A user interface exe file makes use of these DLLs. The console interface
feels much the same as at present and there is a Windows GUI interface
which is a drop-in replacement.

Alternatively, the code can be compiled together, as at present, without
changing the source code. I hope it is platform independent except for
the GUI and the deployment of the DLLs.

Separating all the parts so that they can be separately compiled has been a
challenge, because I wanted the conversion DLL and the user interfaces
not to use directly any of the chemistry - they do not #include mol.h.
This has meant the use of C++ a bit more adventurous than in the current
code. For instance I found it necessary to use a smart pointer from
the Boost library. This is not part of the standard language (although
pretty close).  I also need to point out that I am not a C++ expert -
but it is all working ok at present.

Using the DLLs, existing applications can add  a much broader input format
compatibility while not needing to be recompiled when the OB code changes.

To illustrate the use of a non-OBMol conversion I have added a format
for converting to and from a RXN file describing a reaction.

The Windows interface has a novel feature that uses the text description
of the various conversion options (previously output as help in the command
line interface) to dynamically construct a set of checkboxes, etc
appropriate to the requested file format. You can try a statically-linked
compiled version of the GUI-driven framework with a few formats by just
downloading, extracting and running
http://www.arcl02.dsl.pipex.com/OB/OBGUIs.zip (407K)
It should work on any 32bit Windows system.

It may be that making changes like this to a project that puts the emphasis
on the chemistry rather than programing is a bit over the top. Is it worth
developing a non-backward compatible framework like this any
further?

Chris Morley

--__--__--

Message: 2
Date: Sun, 04 Jan 2004 17:34:04 +0000
To: <ope...@li...>
From: Peter Murray-Rust <pm...@ca...>
Subject: Re: [Open Babel] Suggested modified conversion framework

At 15:09 04/01/2004 +0000, Chris Morley wrote:

 >>"Geoff Hutchison" wrote (some time back)
 >
 >>> > So as a self-proclaimed "maintainer" of the project, I have to keep
 >>> > some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, 
3.0?,
 >>> > etc. releases.
 >
 >>....
 >
 >>> >    (What new formats are needed? What new features? What architecture
 >>> > changes are needed?)
 >>> > 3) Discussions on plans/roadmaps for things that don't go into 
the 2.0
 >>> > release.
 >>> >
 >
 >>
 >>Here are some suggestions for longer-term mods to the conversion process
 >>in OpenBabel. I feel that, although they may not be backward compatible,
 >>these features would be desirable to provide flexibility and 
maintainability
 >>for the future. Most have been previously discussed.

I'd like to support the discussion here and encourage refactoring of babel.
Having spent the last ca 2 weeks rewriting the C++ support for CML (it's
virtually ready  I think it's critical that Babel's design evolves along
modular lines as suggested here and elsewhere.

My vision of babel development is that it should be an API/plugin type of
approach. A developer should be able to write the readFoo and writeFoo
modules by using an API rather than having to understand the whole
architecture of the program. This depends, however, on having very clear
and open architecture and clear understanding of the semantics/ontology
(i.e. exactly what each piece of information means).

I am currently going through this process with CML - it now has about 100
elements ("objects"). Probably about half of these correspond to concepts
in Babel. I am optimistic that most of the concepts in chemistry are
universal - the difficulties lie in different representations. A few
concepts (e.g. aromaticity) depend critically on the algorithms used and so
there is a need for these to be spelled out clearly. (I do not care whether
pyrrole is aromatic or not. however if system A decrees that it is, and
system B does not, we may need both those algorithms to convert between A's
representation and B's.) Such concepts therefore depend on "perception" and
it is critical that the perception is modularised (and in principle
variable on demand).

Most concepts are easier - they depend on careful definition rather than
perception - so it is important to define carefully what is meant by (say)
hydrogen count , e.g. in B2H6.

The core of OB, therefore, is a representation of these concepts. (Whether
it is in C++, Java, XML, UML or RDF/OWL is probably unimportant. At present
the OB core is a mixture of the data fields in mol.h and the ancillary
files (e.g. aromatic.txt). It is important that developers are able to find
the concepts they need quickly and accurately - then writing code is much
easier. In fact I am working towards a system where CML++ code is generated
automatically from the schema.

A Foo developer therefore could follow the following steps:
- identify the concepts in Foo
- map them onto Babel API concepts.
- where they map precisely code the Foo syntax onto the OB API. This can be
almost trivial.

where they do not match, the developer has the options:
- ignore the concept. An good example is that OB ignores bibliographic
info. The information is then lost in the conversion process.
- convert the data to an equivalent OB concept. Examples may be wedge/hatch
bonds converted to atom Parities (though this is not always possible - some
wedges do not correspond to atom-centered stereo). Conversion might be
provided by babel or might be added by the Foo developer.
- write code to add information (an example is molecular formula/mass - not
supported in OB) which can be algorithmically generated.

Where possible it will help if the concepts and representations are
consistent over the OpenSource chemistry community.

 >>- A clearer separation of functions - the  chemistry needs to be more
 >>separated from the conversion. (e.g. OBMol should not be where the
 >>input and output file formats are stored). The FileFormat class could
 >>be beefed up to do this.

Pattern-based design suggests that specialist modules should be created to
manage generic tasks and subclassed where necessary. Thus in CML software
there are decorators which add functionality to classes (e.g. a
moleculeDecorator can wrap a molecule and add getMolecularMass() to it.
Similarly there are serializers (writers) for output and eventReaders
(SAX-like) for input. Each of these is subclassed for different file
formats. A typical pattern for Foo could be

FooReader extends AbstractMolReader implements MolReader
FooWriter extends AbstractMolWriter implements MolWriter

 >>- Each format  needs to be self contained. A new format should not
 >>require any changes  in old code. (This is what abstract base classes
 >>are for.) In a Windows system you might want to have each as a
 >>precompiled DLL - a plugin - which would ease upgrading and allow
 >>installation only of relevant formats.  I guess something similar is
 >>possible in Unix. Even without this feature the formats would be
 >>dynamic - information about them and their options would be retieved
 >>at run time.
 >>
 >>- The user interface for conversion would have a clear interface to the
 >>conversion process itself to allow alternatives (GUIs etc). This would
 >>include handling of formats (file extensions, options) in a dynamic way.
 >>
 >>- The input and output routines need to be more aware of the conversion
 >>process so that they can adjust. Examples are the previously discussed
 >>need deal on-the-fly with generated molecules during CML input, and
 >>the need for a conditional <cml>...</cml> wrapper during output.

This is a generic problem for multiple molecules and could be something like

MolWriter.setMultipleMolecules(bool)
MolWriter.addOutputMolecule(mol)        // fails unless MolWriter allows
multiple mols.

 >>- The conversion framework needs to handle more than just OBMol.
 >>I'm sure use would be made of the facility to convert different types of
 >>molecule, sets of molecules(conformers?), reactions, sets of 
reactions, etc.

Yes.

It is important to have a clear data structure for these. CML has been
extended to support these concepts

 >>- There should be more support for non-expert users. There is a big
 >>activation barrier to using the program, which is not appropriate if 
it is
 >>to be used to just convert the format of some files. I realise 
precompiled
 >>code is not the Unix way, but Windows (and Mac?) users expect it.
 >>It would be nice to support it on appropriate platforms, while keeping
 >>a single version of the source code.

Agreed. We have mounted some *.exe on our site - but they tend to get
dated. It is a really tough problem even for smart people to compile C++ on
Windows - make and configure are useless. Note that sourceforge has compile
farms so it should be possible to get a whole range of compilers.

The main problem is commitment to making this happen. It's hard work and
not normally recognised by those outside the development process. If you
write a better architecture and lose 1% functionality you get few thanks!

P.

 >>Peter Murray-Rust

Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069
-- 
Dipl. Chem. Joerg K. Wegner
Center of Bioinformatics Tuebingen (ZBIT)
Department of Computer Architecture
Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany
Phone: (+49/0) 7071 29 78970
Fax: (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de
--
Never mistake motion for action.
                          E. Hemingway