Re: [Cdk-devel] CDK and JUMBO (Meeting between Christoph and PMR)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Monday 11 November 2002 15:15, Peter Murray-Rust wrote:
> Christoph and I met last weekend near Berlin (for conference reasons) but
> managed to spend most of a day discussing CDK (and JUMBO) strategy. We
> believe there is a growing interest in OpenSource in chemistry and that
> these projects provide library functionality for a range of developments.
> [We appreciate that JMOL and JChemPaint are closely linked to this but
> didn't look at them in detail].

As the new Jmol project leader, I will move Jmol towards CDK and in the end 
completely integrate it with CDK.

> These are my own recollections from what we talked about - I'd be
> interested in feedback
>
> There is a real need for a toolkit in Java and CDK provides a useful - but
> not complete - amount of functionality. In many cases this is complementary
> to what JUMBO provides and we thought it important not to duplicate our
> efforts. Duplications not only waste scarce resource but they can lead to
> incompatibilities. There are a number of features in JUMBO which are more
> comprehensively done in CDK and I will switch to CDK for those rather than
> duplicate. Similarly there are reverse directions where we will continue
> development in JUMBO and offer those to CDK users. As a result the
> JUMBOLibrary will be refactored and be considerably smaller and I see this
> as a good thing.
>
> Some of the current strengths are:
>
> CDK:
>
> chemical perception (aromaticity, tautomerism, etc.)
> chemical topology (ringsets, etc.)
> substructure search and isomorphism
> structure generation and layout
> SMILES
>
> JUMBO:
>
> XML, SAX and DOM
> complex documents and legacy file formats
> metadata and dictionaries
> spectra and their interactive rendering
> [We shall therefore not currently develop the SMILES, chemical perception
> and layout in JUMBO].
>
> Both systems have ancillary support for graphics, maths, chemical
> resources, etc. and it may be valuable to harmonise some of these].
>
> We have started to use CDK - as users as well as developers - and offer
> some feedback. Please take this in a positive spirit :-)
>
> It has been easy to create some applications and problematic for others. We
> have used the layout and fingerprinter and these have performed well.
> However we have had problems with i/o routines and also with atomTypes. We
> like the unit test approach but wonder whether all modules have got a test
> harness. More specifically:
> - the i/o routines normally only extract a subset of the fields within a
> file. As examples MDLReader does not extract the 11 fields for each atom
> and does not read the 2D/3D flag while CMLReader does not read atomRef in
> stringArray nor hydrogenCount. None of the I/O routines throw exceptions
> and we feel it is critical that I/O detects *all* violations of the file
> specification (this is the approach we try to take in JUMBOLib). If the
> routines do not process all input then it must be clearly stated what is
> omitted.

Yes, documentation is still too limited... About those missing features... CDK 
has been mostly driven by need, which is quite common in open source 
projects... for some features, like reading some info from a file, there has 
been no need yet, and thus not be added yet... 

I've noted the remarks about the CMLReader and will fix that soon (I consider 
not reading some info from file a bug...) About the MDLReader I've got bigger 
plans... Recently, a newer version has been "published" (on their website), 
being V3000 (indeed, I also do not know why they did not just use V2002 ;)...
Anyway, I'll update the MDLReader soon and include reading of much more 
fields...

> - the atomType seems fragile. I have tried to use
> SaturationChecker.saturateWithHydrogen() and this throws a number of
> nullPEs which appear to be because the AtomTypeFactory doesn't return
> complete atomTypes.
> - have all modules been tested? (even if they aren't included in the unit
> test). For example it trying to debug the use of
> SaturationChecker.saturateWithHydrogen() it would be very useful to be able
> to run a test which was know to have succeeded previously. In this way we
> could tell whether it was incorrect input or a bug. Note that ca 50% of
> software problems are due to bad input so that it is important to provide
> input which is known to work. In writing jumbo.euclid (maths routines) I
> wrote a main() routine for each class which exercised every public method.
> This has the additional benefit of showing users how the modules can be
> used. - some routines do not have useful Javadoc and it is impossible to
> work out what they do.

This is an important note. No, not all classes are properly tested at this 
moment yet. We use JUnit for this purpose, but not all classes have 
corresponding unit tests, and even if they have, they often do not test all 
methods one by one... This has, however, been on my todo list for a few 
months now, but focused on setting up a system for quality assurance for the 
JavaDoc first... Junit testing is next...

> I know from personal experience how much effort it is to put this together
> but it is important that users can rely on what they find in a library.
>
> I will report some of these as bugs.

Yes, I would appreciate that much... having a bug tracking system is a nice 
tool to identify short term goals (fixing bugs...)

Egon