From: Peter Murray-R. <pm...@ca...> - 2005-04-13 13:47:58
|
At 11:37 13/04/2005, Jozsef Kovacs wrote: >Dear Peter, > >I'm Jozsef Kovacs from ChemAxon and I'm going to work on the enhancement >of our CML. You had some concern about our CML compatibility (Date: Thu, >07 Apr 2005 08:54:05 +0100). We support the subset of the CML elements >that are relevant for our applications. We have also added certain new >attributes that are not in your Schema. The reason is that our >applications have many features that are not supported in your Schema. >This means, that we can import the CML files created in different >applications but we ignore those elements that we don't support. We >assumed that other applications ignore our attributes too. >However, if this assumption is not correct, we are ready to modify our >code. We are about to remove the attributes that don't belong to your >Schema. You have mentioned that "the order of atoms may be important". >Could you tell me more about it? >Anyway, the ChemAxon's goal is to meet the newest industry standards, and >it would be a pleasure to cooperate with you in fixing this problem. We >look forward to your comments. > >Best regards, >Jozsef Kovacs Many thanks Jozsef, This is an excellent suggestion and we accept your motivation and commitment. Public discussion is extremely important and very valuable. We will respond positively, hopefully rapidly, and enthusiastically to questions raised here. We expect that there will be contributions from a range of people. The process of creating CML rests on an unwritten process (rather like the British Constitution, which is never written). There are certain fundamentals, rather similar to the IETF's "general consensus and running code". The essentials include: - CML must be conforming XML. (I have seen things calling themselves "CML" which did not even parse in generic XML tools). - where possible CML uses emerging W3C technology rather than inventing its own. Thus we use DOM, SAX, RDF. RSS, XSLT, XSD, etc. - CML interoperates with other XML languages through XML namespaces - the definition of CML is taken from the publications in peer-reviewed literature. This means that the latest formal specification is JCICS 2003. We all intend to conform to that - CML is an Open process to the extent that it is published, we receive contributions and acknowledge them, we promote and applaud interoperability. Where possible everything, including discussions like this, is openly visible and much should be made available through Open redistribution license such as Creative Commons and Budapest/Berlin/Bethesda declarations of Open Access. However it is not like Open source as you are not allowed to modify the definition of the specification. CML welcomes non-Open conformant implementations - and does not regard them as morally inferior. However CML cannot use closed source for its conformance testing. In the design of CML there are certain principles. - explicit semantics are preferable to implicit semantics, even at the cost of some verbosity - a feature should have been exposed to the community before being incorporated in the publication - new features are resisted until it is impossible to refuse them. We avoid bloat - therefore there is usually an experimental specification before the next formalisation. - features are not removed - this would be difficult for existing applications - but they may be obsoleted. - elements and attributes should - as far as possible - be context-independent. This means subsets of the schema can be used. For example a theochem application might only use molecule and atoms (no bonds, or anything else). - in all CML applications the default semantics of elements and attributes must be identical. Thus formalCharge represents an integral number of electrons removed from or added to an atom. However the convention attribute allows additional semantics to be added. For example there is little communal systematisation of bond orders and types. CML uses "1"/"2"/"3" (or "S/D/T" for "normal single/double/triple" bonds. Other values are allowed but should have a convention. Thus "4" could mean aromatic for convention="MDL" and quadruple for others. - CML applications may ignore foreign namespaces. For example a cml:molecule could contain an SVG element, or an SVG document could contain a cml:molecule - prefixes (e.g. cml:) are NOT hardcoded. They must be accompanied by a namespace declaration. - additional elements and attributes in the CML namespace are NOT allowed. It would be easy for files to collide if this were allowed. In the development of CML software there are also certain principles. - CML itself is not software. The equivalent of a bug is an inconsistency, and of a feature is an unhappy piece of design which is ugly or difficult to use. - software should strive to be conformant. We intend to produce conformance tools in the near future. - it should be easy to develop simple applications of CML. A CML processor may ignore elements and attributes if the interpretation does not depend on them. For example some current CML software does not interpret reaction or spectrum. - in principle all XML input should be passed to the output if required. However this requires significant DOM programming and the W3C DOM is not user-friendly. Therefore some CML processing may lose information. Ideally a roundtrip of readCML->writeCML->readCML should be lossless, but this is difficult to achieve. - All CML software should interpret information in the same way (unless it ignores it). It should not invent local semantics. Thus if 100 molecules are concatenate in a CML document the semantics are just that - 100 concatenated molecules. They are not necessarily snapshots on a dynamics trajectory, different experimental observations, etc. We are developing RDF as the method to annotate complex compound documents. - No conformant CML file should cause CML-aware software to crash, and error messages should be as informative as possible - e.g. "FOOBAR does not support the CML reaction element | the CML array syntax | the CML map/link vocabulary, etc. and this document will not be processed", "PLINGE has detected multiple CML molecule elements and displayed each in a separate panel. CML spectrum elements are ignored". Then users know slightly better - CML documents range over a very wide variety - molecules, comp chem output, instruction manuals, synthetic recipes, journal articles, etc. Multiple namespaces will be common. There is no default "best" way to display or process these and there is unlikely to be a "CML browser" that does everything. However there are likely to be generic tools which manage compound documents and which can accept CML plugins to display chemistry in foreign contexts. In general the CML in JCICS2003 has stood the test and there are very few immediate needs to change the vocabulary in major ways. About 2-3 (unintentionally) implicit semantics have been formalised by a new attribute. The creation of "CMLReact" has involved 2-3 additional elements and these will be submitted for publication shortly. CMLComp is being informed by many marked up outputs and the main need is for the semantics of basisSet to be enlarged. The solid state will be explored over the next year in a funded project. Most of the work involves consolidating and firming up the semantics on current vocabulary. This is hard, because chemistry is very sloppy over its information, but we are making progress. The primary mechanism is the JUMBO toolkit which is element/attribute centred. The semantics of every element is explored and most have been done. For each we create a range of unit tests - currently over 400. This will be amplified by conformance tests. We also now need to create communal dictionaries for the common uses of dictRef. Some of these have been collected for reactions, but they are also required for common CML concepts. Here again anyone cane create their own namespaced dictionaries; if there is communal agreement, terms in these may be raised to the communal CML area. Some areas of CML are more explored than others. Thus we have intensively explored reactions and are reasonably confident that the specifications is robust. We are exploring spectra but have some way to go. We have much experience with comp chem calculations on geometry optimisation and properties, but little on dynamics and ensembles. Recently we have made major advances in crystallography. CML is a meritocracy and participants are honoured by their contribution - see Eric Raymond's "Homesteading the Noosphere". Our methods are Open and we aim for interoperability. Very recently we have decided to define Web-service and related APIs to build large networks applications - see http://wwmm.ch.cam.ac.uk/presentations/acs2005 for a summary of some of this. We intend to summarise these, probably on the QSAR list. We cannot include non-Open software under "Blue Obelisk" but we can - if resources allow - highlight non-Open software that interoperates via CML. Thank you very much for catalysing this discussion. All members of this list are equally welcome and all contributions are taken in a positive spirit. Henry and I have moderated 30,000 emails in XML-DEV without a flamewar or spam. You may wish to ask questions, make suggestions, recount your experiences, etc. You may make product announcements *to the extent that they inform the list community about CML and steer clear of hype and vapourware*. For example it would be very useful to know that FOOBAR's parser could read 10,000 CML files per minute, or that they had a CML-compliant format for publishing logP, or that they had an Openly accessible dictionary of properties, but not that they could calculate 10,000 logP per minute or a secret algorithm for clustering molecules. Please also note that JUMBO is Open source and interoperates with other Open (Blue Obelisk) groups (e.g. CDK/JOELib/QSAR/JChemPaint/Jmol/QSAR/Octet/Openbabel). Messages are sometimes crossposted there, but should generally be consistent with the core philosophy of those lists. There are many things I have not written and this may be a good time to start introducing CML from scratch to some list members. P. PS. My part of this mail is re-usable under Creative Commons. Other parts might be re-usable under "fair use" (please note I am no longer at Nottingham, but at Cambridge). Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 Fax: +44 1223 763076 |