From: Joerg K. W. <we...@in...> - 2004-01-07 08:39:31
|
Joerg K. Wegner wrote: > Message: 1 > Reply-To: "Chris Morley" <c.m...@ga...> > From: "Chris Morley" <c.m...@ga...> > To: <ope...@li...> > Date: Sun, 4 Jan 2004 15:09:00 -0000 > Subject: [Open Babel] Suggested modified conversion framework > > "Geoff Hutchison" wrote (some time back) > > >> So as a self-proclaimed "maintainer" of the project, I have to keep > >> some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, 3.0?, > >> etc. releases. > > .... > > >> (What new formats are needed? What new features? What architecture > >> changes are needed?) > >> 3) Discussions on plans/roadmaps for things that don't go into the 2.0 > >> release. > >> > > > Here are some suggestions for longer-term mods to the conversion process > in OpenBabel. I feel that, although they may not be backward compatible, > these features would be desirable to provide flexibility and > maintainability > for the future. Most have been previously discussed. > > - A clearer separation of functions - the chemistry needs to be more > separated from the conversion. (e.g. OBMol should not be where the > input and output file formats are stored). The FileFormat class could > be beefed up to do this. > > - Each format needs to be self contained. A new format should not > require any changes in old code. (This is what abstract base classes > are for.) In a Windows system you might want to have each as a > precompiled DLL - a plugin - which would ease upgrading and allow > installation only of relevant formats. I guess something similar is > possible in Unix. Even without this feature the formats would be > dynamic - information about them and their options would be retieved > at run time. > > - The user interface for conversion would have a clear interface to the > conversion process itself to allow alternatives (GUIs etc). This would > include handling of formats (file extensions, options) in a dynamic way. > > - The input and output routines need to be more aware of the conversion > process so that they can adjust. Examples are the previously discussed > need deal on-the-fly with generated molecules during CML input, and > the need for a conditional <cml>...</cml> wrapper during output. > > - The conversion framework needs to handle more than just OBMol. > I'm sure use would be made of the facility to convert different types of > molecule, sets of molecules(conformers?), reactions, sets of reactions, > etc. > > - There should be more support for non-expert users. There is a big > activation barrier to using the program, which is not appropriate if it is > to be used to just convert the format of some files. I realise precompiled > code is not the Unix way, but Windows (and Mac?) users expect it. > It would be nice to support it on appropriate platforms, while keeping > a single version of the source code. > > I've put together some working code which implements all of these > features, described in more detail at > http://www.arcl02.dsl.pipex.com/OB/obframework.html > > It is written so that it can be deployed as separate DLLs containing: > the main chemistry (the code is almost as at present); > the conversion process (a new class); > one or more formats (can be the existing code with a small wrapper); > > A user interface exe file makes use of these DLLs. The console interface > feels much the same as at present and there is a Windows GUI interface > which is a drop-in replacement. > > Alternatively, the code can be compiled together, as at present, without > changing the source code. I hope it is platform independent except for > the GUI and the deployment of the DLLs. > > Separating all the parts so that they can be separately compiled has been a > challenge, because I wanted the conversion DLL and the user interfaces > not to use directly any of the chemistry - they do not #include mol.h. > This has meant the use of C++ a bit more adventurous than in the current > code. For instance I found it necessary to use a smart pointer from > the Boost library. This is not part of the standard language (although > pretty close). I also need to point out that I am not a C++ expert - > but it is all working ok at present. > > Using the DLLs, existing applications can add a much broader input format > compatibility while not needing to be recompiled when the OB code changes. > > To illustrate the use of a non-OBMol conversion I have added a format > for converting to and from a RXN file describing a reaction. > > The Windows interface has a novel feature that uses the text description > of the various conversion options (previously output as help in the command > line interface) to dynamically construct a set of checkboxes, etc > appropriate to the requested file format. You can try a statically-linked > compiled version of the GUI-driven framework with a few formats by just > downloading, extracting and running > http://www.arcl02.dsl.pipex.com/OB/OBGUIs.zip (407K) > It should work on any 32bit Windows system. > > It may be that making changes like this to a project that puts the emphasis > on the chemistry rather than programing is a bit over the top. Is it worth > developing a non-backward compatible framework like this any > further? > > Chris Morley > > > > > > > > > --__--__-- > > Message: 2 > Date: Sun, 04 Jan 2004 17:34:04 +0000 > To: <ope...@li...> > From: Peter Murray-Rust <pm...@ca...> > Subject: Re: [Open Babel] Suggested modified conversion framework > > At 15:09 04/01/2004 +0000, Chris Morley wrote: > > >>"Geoff Hutchison" wrote (some time back) > > > >>> > So as a self-proclaimed "maintainer" of the project, I have to keep > >>> > some idea in the back of my head how we'll get to 2.0, 2.1, 2.2, > 3.0?, > >>> > etc. releases. > > > >>.... > > > >>> > (What new formats are needed? What new features? What > architecture > >>> > changes are needed?) > >>> > 3) Discussions on plans/roadmaps for things that don't go into > the 2.0 > >>> > release. > >>> > > > > >> > >>Here are some suggestions for longer-term mods to the conversion process > >>in OpenBabel. I feel that, although they may not be backward compatible, > >>these features would be desirable to provide flexibility and > maintainability > >>for the future. Most have been previously discussed. > > > I'd like to support the discussion here and encourage refactoring of babel. > Having spent the last ca 2 weeks rewriting the C++ support for CML (it's > virtually ready I think it's critical that Babel's design evolves along > modular lines as suggested here and elsewhere. > > My vision of babel development is that it should be an API/plugin type of > approach. A developer should be able to write the readFoo and writeFoo > modules by using an API rather than having to understand the whole > architecture of the program. This depends, however, on having very clear > and open architecture and clear understanding of the semantics/ontology > (i.e. exactly what each piece of information means). > > I am currently going through this process with CML - it now has about 100 > elements ("objects"). Probably about half of these correspond to concepts > in Babel. I am optimistic that most of the concepts in chemistry are > universal - the difficulties lie in different representations. A few > concepts (e.g. aromaticity) depend critically on the algorithms used and so > there is a need for these to be spelled out clearly. (I do not care whether > pyrrole is aromatic or not. however if system A decrees that it is, and > system B does not, we may need both those algorithms to convert between A's > representation and B's.) Such concepts therefore depend on "perception" and > it is critical that the perception is modularised (and in principle > variable on demand). > > Most concepts are easier - they depend on careful definition rather than > perception - so it is important to define carefully what is meant by (say) > hydrogen count , e.g. in B2H6. > > The core of OB, therefore, is a representation of these concepts. (Whether > it is in C++, Java, XML, UML or RDF/OWL is probably unimportant. At present > the OB core is a mixture of the data fields in mol.h and the ancillary > files (e.g. aromatic.txt). It is important that developers are able to find > the concepts they need quickly and accurately - then writing code is much > easier. In fact I am working towards a system where CML++ code is generated > automatically from the schema. > > A Foo developer therefore could follow the following steps: > - identify the concepts in Foo > - map them onto Babel API concepts. > - where they map precisely code the Foo syntax onto the OB API. This can be > almost trivial. > > where they do not match, the developer has the options: > - ignore the concept. An good example is that OB ignores bibliographic > info. The information is then lost in the conversion process. > - convert the data to an equivalent OB concept. Examples may be wedge/hatch > bonds converted to atom Parities (though this is not always possible - some > wedges do not correspond to atom-centered stereo). Conversion might be > provided by babel or might be added by the Foo developer. > - write code to add information (an example is molecular formula/mass - not > supported in OB) which can be algorithmically generated. > > Where possible it will help if the concepts and representations are > consistent over the OpenSource chemistry community. > > > >>- A clearer separation of functions - the chemistry needs to be more > >>separated from the conversion. (e.g. OBMol should not be where the > >>input and output file formats are stored). The FileFormat class could > >>be beefed up to do this. > > > Pattern-based design suggests that specialist modules should be created to > manage generic tasks and subclassed where necessary. Thus in CML software > there are decorators which add functionality to classes (e.g. a > moleculeDecorator can wrap a molecule and add getMolecularMass() to it. > Similarly there are serializers (writers) for output and eventReaders > (SAX-like) for input. Each of these is subclassed for different file > formats. A typical pattern for Foo could be > > FooReader extends AbstractMolReader implements MolReader > FooWriter extends AbstractMolWriter implements MolWriter > > > > >>- Each format needs to be self contained. A new format should not > >>require any changes in old code. (This is what abstract base classes > >>are for.) In a Windows system you might want to have each as a > >>precompiled DLL - a plugin - which would ease upgrading and allow > >>installation only of relevant formats. I guess something similar is > >>possible in Unix. Even without this feature the formats would be > >>dynamic - information about them and their options would be retieved > >>at run time. > >> > >>- The user interface for conversion would have a clear interface to the > >>conversion process itself to allow alternatives (GUIs etc). This would > >>include handling of formats (file extensions, options) in a dynamic way. > >> > >>- The input and output routines need to be more aware of the conversion > >>process so that they can adjust. Examples are the previously discussed > >>need deal on-the-fly with generated molecules during CML input, and > >>the need for a conditional <cml>...</cml> wrapper during output. > > > This is a generic problem for multiple molecules and could be something > like > > MolWriter.setMultipleMolecules(bool) > MolWriter.addOutputMolecule(mol) // fails unless MolWriter allows > multiple mols. > > > >>- The conversion framework needs to handle more than just OBMol. > >>I'm sure use would be made of the facility to convert different types of > >>molecule, sets of molecules(conformers?), reactions, sets of > reactions, etc. > > > Yes. > > It is important to have a clear data structure for these. CML has been > extended to support these concepts > > > >>- There should be more support for non-expert users. There is a big > >>activation barrier to using the program, which is not appropriate if > it is > >>to be used to just convert the format of some files. I realise > precompiled > >>code is not the Unix way, but Windows (and Mac?) users expect it. > >>It would be nice to support it on appropriate platforms, while keeping > >>a single version of the source code. > > > Agreed. We have mounted some *.exe on our site - but they tend to get > dated. It is a really tough problem even for smart people to compile C++ on > Windows - make and configure are useless. Note that sourceforge has compile > farms so it should be possible to get a whole range of compilers. > > The main problem is commitment to making this happen. It's hard work and > not normally recognised by those outside the development process. If you > write a better architecture and lose 1% functionality you get few thanks! > > P. > > >>Peter Murray-Rust > > > Unilever Centre for Molecular Informatics > Chemistry Department, Cambridge University > Lensfield Road, CAMBRIDGE, CB2 1EW, UK > Tel: +44-1223-763069 Hi, Happy new year to all ! Interesting ... i see two basic problems: The molecule representation and their definition is based on four expert system. And for conversion you need any kind of atom typer. OpenBabel/JOELib process to assign atom types. http://www-ra.informatik.uni-tuebingen.de/software/joelib/tutorial/atomtyper.html So what is the problem ? 1. There are some algorithms required, like SSSR, SMARTS (also partly based on this assigning process), aromaticTyper. Some code fragments modified version of the already published algorithms, but notStandaloneEnough to publish these fragments again. 2. There are definitions for the assignment needed, like aromatic.txt, atomtype.txt, phmodel.txt ... which are also based on SMARTS ... so that's fine, so users can define their own protonation model. FINALLY: 1. The separation is from the object oriented design recommended ! This will be a great benefit ! Good work ! 2. Before no abstract definition (pseudocode or something else) of the assigning process is available the disconnection between the molecules and the conversion is NOT possible. Special cases can be treated, but nothing else. Otherwise every molecule must have it's own atom typer, which is in my opinion, a huge performance problem. But what's about a atomTyperCache and not Singleton classes (JOELib) or static data/methods (C++) as the actual implementations? I do not know any publication or exact definition of what is a OpenBabel/JOELib molecule, because this is really complex. A possibility could be to define a huge molecular data set with atom types and formulate a classification+optimization problem, so this is the most transparent and most correct way for a computer scientist, but who will create and publish such a huge database (manpower ?). What's about tautomers in this data base ? Can this be another classification task ? I would like, to have such a database, because i'm interested in the optimization and dataMining approach, but time is rare ... Regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. E. Hemingway |