From: Joerg K. W. <we...@in...> - 2004-01-07 08:37:41
|
Message: 1 Reply-To: "Chris Morley" <c.m...@ga...> From: "Chris Morley" <c.m...@ga...> To: <ope...@li...> Date: Sun, 4 Jan 2004 15:09:00 -0000 Subject: [Open Babel] Suggested modified conversion framework "Geoff Hutchison" wrote (some time back) >> So as a self-proclaimed "maintainer" of the project, I have to keep >> some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, 3.0?, >> etc. releases. .... >> (What new formats are needed? What new features? What architecture >> changes are needed?) >> 3) Discussions on plans/roadmaps for things that don't go into the 2.0 >> release. >> Here are some suggestions for longer-term mods to the conversion process in OpenBabel. I feel that, although they may not be backward compatible, these features would be desirable to provide flexibility and maintainability for the future. Most have been previously discussed. - A clearer separation of functions - the chemistry needs to be more separated from the conversion. (e.g. OBMol should not be where the input and output file formats are stored). The FileFormat class could be beefed up to do this. - Each format needs to be self contained. A new format should not require any changes in old code. (This is what abstract base classes are for.) In a Windows system you might want to have each as a precompiled DLL - a plugin - which would ease upgrading and allow installation only of relevant formats. I guess something similar is possible in Unix. Even without this feature the formats would be dynamic - information about them and their options would be retieved at run time. - The user interface for conversion would have a clear interface to the conversion process itself to allow alternatives (GUIs etc). This would include handling of formats (file extensions, options) in a dynamic way. - The input and output routines need to be more aware of the conversion process so that they can adjust. Examples are the previously discussed need deal on-the-fly with generated molecules during CML input, and the need for a conditional <cml>...</cml> wrapper during output. - The conversion framework needs to handle more than just OBMol. I'm sure use would be made of the facility to convert different types of molecule, sets of molecules(conformers?), reactions, sets of reactions, etc. - There should be more support for non-expert users. There is a big activation barrier to using the program, which is not appropriate if it is to be used to just convert the format of some files. I realise precompiled code is not the Unix way, but Windows (and Mac?) users expect it. It would be nice to support it on appropriate platforms, while keeping a single version of the source code. I've put together some working code which implements all of these features, described in more detail at http://www.arcl02.dsl.pipex.com/OB/obframework.html It is written so that it can be deployed as separate DLLs containing: the main chemistry (the code is almost as at present); the conversion process (a new class); one or more formats (can be the existing code with a small wrapper); A user interface exe file makes use of these DLLs. The console interface feels much the same as at present and there is a Windows GUI interface which is a drop-in replacement. Alternatively, the code can be compiled together, as at present, without changing the source code. I hope it is platform independent except for the GUI and the deployment of the DLLs. Separating all the parts so that they can be separately compiled has been a challenge, because I wanted the conversion DLL and the user interfaces not to use directly any of the chemistry - they do not #include mol.h. This has meant the use of C++ a bit more adventurous than in the current code. For instance I found it necessary to use a smart pointer from the Boost library. This is not part of the standard language (although pretty close). I also need to point out that I am not a C++ expert - but it is all working ok at present. Using the DLLs, existing applications can add a much broader input format compatibility while not needing to be recompiled when the OB code changes. To illustrate the use of a non-OBMol conversion I have added a format for converting to and from a RXN file describing a reaction. The Windows interface has a novel feature that uses the text description of the various conversion options (previously output as help in the command line interface) to dynamically construct a set of checkboxes, etc appropriate to the requested file format. You can try a statically-linked compiled version of the GUI-driven framework with a few formats by just downloading, extracting and running http://www.arcl02.dsl.pipex.com/OB/OBGUIs.zip (407K) It should work on any 32bit Windows system. It may be that making changes like this to a project that puts the emphasis on the chemistry rather than programing is a bit over the top. Is it worth developing a non-backward compatible framework like this any further? Chris Morley --__--__-- Message: 2 Date: Sun, 04 Jan 2004 17:34:04 +0000 To: <ope...@li...> From: Peter Murray-Rust <pm...@ca...> Subject: Re: [Open Babel] Suggested modified conversion framework At 15:09 04/01/2004 +0000, Chris Morley wrote: >>"Geoff Hutchison" wrote (some time back) > >>> > So as a self-proclaimed "maintainer" of the project, I have to keep >>> > some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, 3.0?, >>> > etc. releases. > >>.... > >>> > (What new formats are needed? What new features? What architecture >>> > changes are needed?) >>> > 3) Discussions on plans/roadmaps for things that don't go into the 2.0 >>> > release. >>> > > >> >>Here are some suggestions for longer-term mods to the conversion process >>in OpenBabel. I feel that, although they may not be backward compatible, >>these features would be desirable to provide flexibility and maintainability >>for the future. Most have been previously discussed. I'd like to support the discussion here and encourage refactoring of babel. Having spent the last ca 2 weeks rewriting the C++ support for CML (it's virtually ready I think it's critical that Babel's design evolves along modular lines as suggested here and elsewhere. My vision of babel development is that it should be an API/plugin type of approach. A developer should be able to write the readFoo and writeFoo modules by using an API rather than having to understand the whole architecture of the program. This depends, however, on having very clear and open architecture and clear understanding of the semantics/ontology (i.e. exactly what each piece of information means). I am currently going through this process with CML - it now has about 100 elements ("objects"). Probably about half of these correspond to concepts in Babel. I am optimistic that most of the concepts in chemistry are universal - the difficulties lie in different representations. A few concepts (e.g. aromaticity) depend critically on the algorithms used and so there is a need for these to be spelled out clearly. (I do not care whether pyrrole is aromatic or not. however if system A decrees that it is, and system B does not, we may need both those algorithms to convert between A's representation and B's.) Such concepts therefore depend on "perception" and it is critical that the perception is modularised (and in principle variable on demand). Most concepts are easier - they depend on careful definition rather than perception - so it is important to define carefully what is meant by (say) hydrogen count , e.g. in B2H6. The core of OB, therefore, is a representation of these concepts. (Whether it is in C++, Java, XML, UML or RDF/OWL is probably unimportant. At present the OB core is a mixture of the data fields in mol.h and the ancillary files (e.g. aromatic.txt). It is important that developers are able to find the concepts they need quickly and accurately - then writing code is much easier. In fact I am working towards a system where CML++ code is generated automatically from the schema. A Foo developer therefore could follow the following steps: - identify the concepts in Foo - map them onto Babel API concepts. - where they map precisely code the Foo syntax onto the OB API. This can be almost trivial. where they do not match, the developer has the options: - ignore the concept. An good example is that OB ignores bibliographic info. The information is then lost in the conversion process. - convert the data to an equivalent OB concept. Examples may be wedge/hatch bonds converted to atom Parities (though this is not always possible - some wedges do not correspond to atom-centered stereo). Conversion might be provided by babel or might be added by the Foo developer. - write code to add information (an example is molecular formula/mass - not supported in OB) which can be algorithmically generated. Where possible it will help if the concepts and representations are consistent over the OpenSource chemistry community. >>- A clearer separation of functions - the chemistry needs to be more >>separated from the conversion. (e.g. OBMol should not be where the >>input and output file formats are stored). The FileFormat class could >>be beefed up to do this. Pattern-based design suggests that specialist modules should be created to manage generic tasks and subclassed where necessary. Thus in CML software there are decorators which add functionality to classes (e.g. a moleculeDecorator can wrap a molecule and add getMolecularMass() to it. Similarly there are serializers (writers) for output and eventReaders (SAX-like) for input. Each of these is subclassed for different file formats. A typical pattern for Foo could be FooReader extends AbstractMolReader implements MolReader FooWriter extends AbstractMolWriter implements MolWriter >>- Each format needs to be self contained. A new format should not >>require any changes in old code. (This is what abstract base classes >>are for.) In a Windows system you might want to have each as a >>precompiled DLL - a plugin - which would ease upgrading and allow >>installation only of relevant formats. I guess something similar is >>possible in Unix. Even without this feature the formats would be >>dynamic - information about them and their options would be retieved >>at run time. >> >>- The user interface for conversion would have a clear interface to the >>conversion process itself to allow alternatives (GUIs etc). This would >>include handling of formats (file extensions, options) in a dynamic way. >> >>- The input and output routines need to be more aware of the conversion >>process so that they can adjust. Examples are the previously discussed >>need deal on-the-fly with generated molecules during CML input, and >>the need for a conditional <cml>...</cml> wrapper during output. This is a generic problem for multiple molecules and could be something like MolWriter.setMultipleMolecules(bool) MolWriter.addOutputMolecule(mol) // fails unless MolWriter allows multiple mols. >>- The conversion framework needs to handle more than just OBMol. >>I'm sure use would be made of the facility to convert different types of >>molecule, sets of molecules(conformers?), reactions, sets of reactions, etc. Yes. It is important to have a clear data structure for these. CML has been extended to support these concepts >>- There should be more support for non-expert users. There is a big >>activation barrier to using the program, which is not appropriate if it is >>to be used to just convert the format of some files. I realise precompiled >>code is not the Unix way, but Windows (and Mac?) users expect it. >>It would be nice to support it on appropriate platforms, while keeping >>a single version of the source code. Agreed. We have mounted some *.exe on our site - but they tend to get dated. It is a really tough problem even for smart people to compile C++ on Windows - make and configure are useless. Note that sourceforge has compile farms so it should be possible to get a whole range of compilers. The main problem is commitment to making this happen. It's hard work and not normally recognised by those outside the development process. If you write a better architecture and lose 1% functionality you get few thanks! P. >>Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. E. Hemingway |