From: Peter Murray-R. <pm...@ca...> - 2005-07-29 21:09:55
|
On Fri, 2005-07-29 at 16:22 +0100, Chris Morley wrote: > I have started from scratch again with the OB interface to cmlpp > and now have a format which can read CML using DOM. Is this a W3C DOM? IMO There is no need to have complete W3C functionality - a simple DOM will do. > At present it > uses the basic atom attributes: id, elementType, hydrogenCount, > spinMultiplicity, formalCharge, isotopeNumber, x2,y2,x3,y3,z3. I would add occupancy... > And, at last, it handles CML files with multiple molecules and can > pick molecules out of CMLReact files. (It ignores molecule > elements with the ref attribute.) Noted. > > I started on using cmlpp to output CML, but realised that it a > stand-alone function writing the XML for CML2 directly was no > longer. This is no doubt heresy to Java programmers with > essentially built in support for XML and sounds a bit like people > who used to advocate code with GOTOs because it was shorter and > "more efficient". But I don't think the crude way would be any > more difficult to extend or maintain. However, we could use > something more trendy if it offends too much or if you don't like > punctuation characters (sometimes up to 9 consecutively!). It > could be extended easily but inelegantly to writing CML1. > I think it's worth having a serializer class (see xom.nu for functionality). It manages the following: * escaping XML characters * namespaces (we now have to manage them properly - XOM does an excellent job) * prettyprinting * adding prolog and XMLDeclaration It can also be subclassed... > cmlpp has a parser, DOM model and a set of CML classes. I found it > a bit uneven and inconsistent, and have had to make small changes. No surprised > Peter, are you still intending the cmlpp module to be shared with > programs other than OB. Yes - it is also required for InChI > How much could it be altered? It > presumably would use the automatically generated functions > described above. The XML namespace handling could be extended. I > guess that compound XML documents with objects defined by many > schemas will become increasingly common, yes > and OB should provide the > capability to abstract the appropriate chemical parts (and not > just molecules). Doing namespaces properly would help this. > I have been sloppy about namespaces until now. Now we have to do it properly. Every CML document should now strive to have a namespace > Large documents will need to be handled and I think that we are > agreed that SAX parsing is the way to go. We don't want to have to > build a DOM of somebody's thesis to abstract a few molecules or > spectra. > Agreed > Using libxml2 as a parser would be a safer route than cmlpp. Agreed > It is > written in C but the amount of messy interface would not be great. I am happy to hope this will work. > The C++ wrapper I have seen does SAX1 but not SAX2 and I agree > with Geoff that we should go with the basic libxml. Yes > The compiled > DLL with its C interface seems to work for me in Windows, but I > haven't done anything extensive with it. I have not found any > examples of libxml2 SAX2 code, which would be useful as a start. > The main advantage of SAX2 is the namespaces. > How schema-derived code in SAX would be applied is not so clear to > me. The SAX events have to be filtered for namespace and then by > element name. SAX2 does them at the same time. If you are doing all your own processing then SAX is all you need. If you are also building a DOM then it is useful to have a NodeFactory (see XOM). I have autogenerated CMLNodefactory which can generate instances of classes. (the subclasses may not be required for OB) > I was envisaging the interface to the parser being > done in a XMLFormat class, which would also do the namespace > filtering. The separation by element would be done in derived > format classes like (a new) CMLFormat.These classes would be > registered at startup with XMLFormat by namespace, like formats > are by file extension. > Sounds good > Although it looks like it has two dead ends, I've put the > cmlpp/DOM reading and standalone writing version of CMLFormat in > cvs in src/formats/cmlppformat.cpp. > I#ll try to have a look over the w/e > Chris P. -- Peter Murray-Rust Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge Cambridge, CB2 1EW, UK Tel: +44-1223-760369 |