[cml/ccml-discuss] Re: [Open Babel] Re: New CML code

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, 2005-07-29 at 16:22 +0100, Chris Morley wrote:

> I have started from scratch again with the OB interface to cmlpp 
> and now have a format which can read CML using DOM.

Is this a W3C DOM? IMO There is no need to have complete W3C
functionality - a simple DOM will do.

>  At present it 
> uses the basic atom attributes: id, elementType, hydrogenCount, 
> spinMultiplicity, formalCharge, isotopeNumber, x2,y2,x3,y3,z3. 

I would add occupancy...

> And, at last, it handles CML files with multiple molecules and can 
> pick molecules out of CMLReact files. (It ignores molecule 
> elements with the ref attribute.)

Noted.

> 
> I started on using cmlpp to output CML, but realised that it a 
> stand-alone function writing the XML for CML2 directly was no 
> longer. This is no doubt heresy to Java programmers with 
> essentially built in support for XML and sounds a bit like people 
> who used to advocate code with GOTOs because it was shorter and 
> "more efficient". But I don't think the crude way would be any 
> more difficult to extend or maintain. However, we could use 
> something more trendy if it offends too much or if you don't like 
> punctuation characters (sometimes up to 9 consecutively!). It 
> could be extended easily but inelegantly to writing CML1.
> 
I think it's worth having a serializer class (see xom.nu for
functionality). It manages the following:
* escaping XML characters
* namespaces (we now have to manage them properly - XOM does an
excellent job)
* prettyprinting
* adding prolog and XMLDeclaration

It can also be subclassed...

> cmlpp has a parser, DOM model and a set of CML classes. I found it 
> a bit uneven and inconsistent, and have had to make small changes. 

No surprised

> Peter, are you still intending the cmlpp module to be shared with 
> programs other than OB. 

Yes - it is also required for InChI

> How much could it be altered? It 
> presumably would use the automatically generated functions 
> described above. The XML namespace handling could be extended. I 
> guess that compound XML documents with objects defined by many 
> schemas will become increasingly common, 

yes

> and OB should provide the 
> capability to abstract the appropriate chemical parts (and not 
> just molecules). Doing namespaces properly would help this.
> 
I have been sloppy about namespaces until now. Now we have to do it
properly. Every CML document should now strive to have a namespace

> Large documents will need to be handled and I think that we are 
> agreed that SAX parsing is the way to go. We don't want to have to 
> build a DOM of somebody's thesis to abstract a few molecules or 
> spectra.
> 
Agreed

> Using libxml2 as a parser would be a safer route than cmlpp.

Agreed

>  It is 
> written in C but the amount of messy interface would not be great. 

I am happy to hope this will work.

>   The C++ wrapper I have seen does SAX1 but not SAX2 and I agree 
> with Geoff that we should go with the basic libxml.

Yes

>  The compiled 
> DLL with its C interface seems to work for me in Windows, but I 
> haven't done anything extensive with it. I have not found any 
> examples of libxml2 SAX2 code, which would be useful as a start.
> 
The main advantage of SAX2 is the namespaces.

> How schema-derived code in SAX would be applied is not so clear to 
> me. The SAX events have to be filtered for namespace and then by 
> element name. 

SAX2 does them at the same time. 

If you are doing all your own processing then SAX is all you need. If
you are also building a DOM then it is useful to have a NodeFactory (see
XOM). I have autogenerated CMLNodefactory which can generate instances
of classes. (the subclasses may not be required for OB)

> I was envisaging the interface to the parser being 
> done in a XMLFormat class, which would also do the namespace 
> filtering. The separation by element would be done in derived 
> format classes like (a new) CMLFormat.These classes would be 
> registered at startup with XMLFormat by namespace, like formats 
> are by file extension.
> 
Sounds good

> Although it looks like it has two dead ends, I've put the 
> cmlpp/DOM reading and standalone writing version of CMLFormat in 
> cvs in src/formats/cmlppformat.cpp.
> 

I#ll try to have a look over the w/e

> Chris

P.

-- 
Peter Murray-Rust
Unilever Centre for Molecular Informatics,
Department of Chemistry, University of Cambridge
Cambridge, CB2 1EW, UK
Tel: +44-1223-760369