CML development [was Re: [cml/ccml-discuss] CML format and ChemAxon]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

At 11:37 13/04/2005, Jozsef Kovacs wrote:
>Dear Peter,
>
>I'm Jozsef Kovacs from ChemAxon and I'm going to work on the enhancement 
>of our CML. You had some concern about our CML compatibility (Date: Thu, 
>07 Apr 2005 08:54:05 +0100). We support the subset of the CML elements 
>that are relevant for our applications. We have also added certain new 
>attributes that are not in your Schema. The reason is that our 
>applications have many features that are not supported in your Schema. 
>This means, that we can import the CML files created in different 
>applications but we ignore those elements that we don't support. We 
>assumed that other applications ignore our attributes too.
>However, if this assumption is not correct, we are ready to modify our 
>code. We are about to remove the attributes that don't belong to your 
>Schema. You have mentioned that "the order of atoms may be important". 
>Could you tell me more about it?
>Anyway, the ChemAxon's goal is to meet the newest industry standards, and 
>it would be a pleasure to cooperate with you in fixing this problem. We 
>look forward to your comments.
>
>Best regards,
>Jozsef Kovacs

Many thanks Jozsef,
         This is an excellent suggestion and we accept your motivation and 
commitment. Public discussion is extremely important and very valuable. We 
will respond positively, hopefully rapidly, and enthusiastically to 
questions raised here. We expect that there will be contributions from a 
range of people.

         The process of creating CML rests on an unwritten process (rather 
like the British Constitution, which is never written). There are certain 
fundamentals, rather similar to the IETF's "general consensus and running 
code". The essentials include:
- CML must be conforming XML. (I have seen things calling themselves "CML" 
which did not even parse in generic XML tools).
- where possible CML uses emerging W3C technology rather than inventing its 
own. Thus we use DOM, SAX, RDF. RSS, XSLT, XSD, etc.
- CML interoperates with other XML languages through XML namespaces
- the definition of CML is taken from the publications in peer-reviewed 
literature. This means that the latest formal specification is JCICS 2003. 
We all intend to conform to that
- CML is an Open process to the extent that it is published, we receive 
contributions and acknowledge them, we promote and applaud 
interoperability. Where possible everything, including discussions like 
this, is openly visible and much should be made available through Open 
redistribution license such as Creative Commons and 
Budapest/Berlin/Bethesda declarations of Open Access. However it is not 
like Open source as you are not allowed to modify the definition of the 
specification. CML welcomes non-Open conformant implementations - and does 
not regard them as morally inferior. However CML cannot use closed source 
for its conformance testing.

In the design of CML there are certain principles.
- explicit semantics are preferable to implicit semantics, even at the cost 
of some verbosity
- a feature should have been exposed to the community before being 
incorporated in the publication
- new features are resisted until it is impossible to refuse them. We avoid 
bloat
- therefore there is usually an experimental specification before the next 
formalisation.
- features are not removed - this would be difficult for existing 
applications - but they may be obsoleted.
- elements and attributes should - as far as possible - be 
context-independent. This means subsets of the schema can be used. For 
example a theochem application might only use molecule and atoms (no bonds, 
or anything else).
- in all CML applications the default semantics of elements and attributes 
must be identical. Thus formalCharge represents an integral number of 
electrons removed from or added to an atom. However the convention 
attribute allows additional semantics to be added. For example there is 
little communal systematisation of bond orders  and types. CML uses 
"1"/"2"/"3" (or "S/D/T" for "normal single/double/triple" bonds. Other 
values are allowed but should have a convention. Thus "4" could mean 
aromatic for convention="MDL" and quadruple for others.
- CML applications may ignore foreign namespaces. For example a 
cml:molecule could contain an SVG element, or an SVG document could contain 
a cml:molecule
- prefixes (e.g. cml:) are NOT hardcoded. They must be accompanied by a 
namespace declaration.
- additional elements and attributes in the CML namespace are NOT allowed. 
It would be easy for files to collide if this were allowed.

In the development of CML software there are also certain principles.
- CML itself is not software. The equivalent of a bug is an inconsistency, 
and of a feature is an unhappy piece of design which is ugly or difficult 
to use.
- software should strive to be conformant. We intend to produce conformance 
tools in the near future.
- it should be easy to develop simple applications of CML. A CML processor 
may ignore elements and attributes if the interpretation does not depend on 
them. For example some current CML software does not interpret reaction or 
spectrum.
- in principle all XML input should be passed to the output if required. 
However this requires significant DOM programming and the W3C DOM is not 
user-friendly. Therefore some CML processing may lose information. Ideally 
a roundtrip of readCML->writeCML->readCML should be lossless, but this is 
difficult to achieve.
- All CML software should interpret information in the same way (unless it 
ignores it). It should not invent local semantics. Thus if 100 molecules 
are concatenate  in a CML document the semantics are just that - 100 
concatenated molecules. They are not necessarily snapshots on a dynamics 
trajectory, different experimental observations, etc. We are developing RDF 
as the method to annotate complex compound documents.
- No conformant CML file should cause CML-aware software to crash, and 
error messages should be as informative as possible - e.g.  "FOOBAR does 
not support the CML reaction element | the CML array syntax | the CML 
map/link vocabulary, etc. and this document will not be processed", "PLINGE 
has detected multiple CML molecule elements and displayed each in a 
separate panel. CML spectrum elements are ignored". Then users know 
slightly better
- CML documents range over a very wide variety - molecules, comp chem 
output, instruction manuals, synthetic recipes, journal articles, etc. 
Multiple namespaces will be common. There is no default "best" way to 
display or process these and there is unlikely to be a "CML browser" that 
does everything. However there are likely to be generic tools which manage 
compound documents and which can accept CML plugins to display chemistry in 
foreign contexts.

In general the CML in JCICS2003 has stood the test and there are very few 
immediate needs to change the vocabulary in major ways. About 
2-3  (unintentionally) implicit semantics have been formalised by a new 
attribute. The creation of "CMLReact" has involved 2-3 additional elements 
and these will be submitted for publication shortly. CMLComp is being 
informed by many marked up outputs and the main need is for the semantics 
of basisSet to be enlarged. The solid state will be explored over the next 
year in a funded project. Most of the work involves consolidating and 
firming up the semantics on current vocabulary. This is hard, because 
chemistry is very sloppy over its information, but we are making progress. 
The primary mechanism is the JUMBO toolkit which is element/attribute 
centred. The semantics of every element is explored and most have been 
done. For each we create a range of unit tests - currently over 400. This 
will be amplified by conformance tests.

We also now need to create communal dictionaries for the common uses of 
dictRef. Some of these have been collected for reactions, but they are also 
required for common CML concepts. Here again anyone cane create their own 
namespaced dictionaries; if there is communal agreement, terms in these may 
be raised to the communal CML area.

Some areas of CML are more explored than others. Thus we have intensively 
explored reactions and are reasonably confident that the specifications is 
robust. We are exploring spectra but have some way to go. We have much 
experience with comp chem calculations on geometry optimisation and 
properties, but little on dynamics and ensembles. Recently we have made 
major advances in crystallography.

CML is a meritocracy and participants are honoured by their contribution - 
see Eric Raymond's "Homesteading the Noosphere". Our methods are Open and 
we aim for interoperability. Very recently we have decided to define 
Web-service and related APIs to build large networks applications - see 
http://wwmm.ch.cam.ac.uk/presentations/acs2005 for a summary of some of 
this. We intend to summarise these, probably on the QSAR list. We cannot 
include non-Open software under "Blue Obelisk" but we can - if resources 
allow - highlight non-Open software that interoperates via CML.

Thank you very much for catalysing this discussion. All members of this 
list are equally welcome and all contributions are taken in a positive 
spirit. Henry and I have moderated 30,000 emails in XML-DEV without a 
flamewar or spam. You may wish to ask questions, make suggestions, recount 
your experiences, etc. You may make product announcements *to the extent 
that they inform the list community about CML and steer clear of hype and 
vapourware*. For example it would be very useful to know that FOOBAR's 
parser could read 10,000 CML files per minute, or that they had a 
CML-compliant format for publishing logP, or that they had an Openly 
accessible dictionary of properties, but not that they could calculate 
10,000 logP per minute or a secret algorithm for clustering molecules.

Please also note that JUMBO is Open source and interoperates with other 
Open (Blue Obelisk) groups (e.g. 
CDK/JOELib/QSAR/JChemPaint/Jmol/QSAR/Octet/Openbabel). Messages are 
sometimes crossposted there, but should generally be consistent with the 
core philosophy of those lists.

There are many things I have not written and this may be a good time to 
start introducing CML from scratch to some list members.

P.

PS. My part of this mail is re-usable under Creative Commons. Other parts 
might be re-usable under "fair use"

(please note I am no longer at Nottingham, but at Cambridge).

Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069 Fax: +44 1223 763076