From: Chris M. <c.m...@ga...> - 2005-07-29 15:22:47
|
Peter Murray-Rust wrote: > On Thu, 2005-07-28 at 16:14 -0400, Geoffrey Hutchison wrote: ... > The relevance of this for OB is that the base JUMBO code is > autogenerated from the schema. This drastically reduces maintenance and > errors. The code generator is now working and will soon form another > branch on cml.sf.net. It is possible to XMLvalidate any CML document > against the Schema. > > Until now the code for CMLPP in OB has been handcrafted. We now intend > to try to autogenerate it so it works on a simple DOM or SAX framework. > Since OB only requires to read and static structures a very simple dom > should do (see below in my first message). > > There are some strategic questions for OB. In principle CML is capable > of holding most of the union of the (small-molecule) data formats in OB. > However most conversions will be lossy, so it is likely that a much > smaller subset will map onto OB. > > If we can therefore work out a realistic subset of OB elements and > attributes we can create a reference schema. > > >>(As background for everyone else, Peter and I have been discussing >>improved CML code for Open Babel 2.0.) >> >>On Jul 28, 2005, at 2:25 PM, Peter Murray-Rust wrote: >> >> >>>I have now virtually finished the Java code generator. This would >>>be the >>>most appropriate way to generate C++ if possible. I am not familiar >>>with >>>libxml or xercesC++... >>>* do they have subclassable elements (Xerces J and other Java do NOT. >>>This is a real mess). >>>* do they support SAX (I assume so) >>>... >>>I have now had a weblook at libxml++ and xerces and would go with >>>libxml >>>if it makes sense (weaker license than OB). >> >>I think using libxml2 would be a good idea for XML-based file formats >>in Open Babel. (I don't think we really need the C++ version -- it's >>easy to call C code from C++.) The advantage of using the C version >>is that it's available as a standard library on recent versions of >>Linux and Mac OS X. (I'm not sure about Windows, but it certainly >>compiles there easily.) > > > I agree this seems useful and we should try this first. Does anyone on > the list have experience of libxml2? > > >>- It does support SAX2 and some form of SAX1 (if desired). >>- It does support DOM, although "not the API" >> (I'm not sure if this is really a big deal for us right now and I >>have yet to figure out what they mean by that.) >>- It also offers XmlTextReader for handling large documents (as an >>alternative to SAX) >> >>The C++ version does support subclasses, but I do think the C version >>would be better for us. >> > > If we can create C++ classe which wrap the libxml this should be > possible. > > >>> >>>The main functionality is >>>new element() >>>void element.appendChild(element) >>>void element.setAttributeValue(string, string) >>>string element.getAttributeValue(string) >>>void element.setText(string) >>> ... >>> However we should definitely use a SAX parser. >> > It would be useful to start mapping out a minimal design for the C++ > classes and I can look into generating it. It may not have all the > functionality of JUMBO but is should at least replicate OB100 > I have started from scratch again with the OB interface to cmlpp and now have a format which can read CML using DOM. At present it uses the basic atom attributes: id, elementType, hydrogenCount, spinMultiplicity, formalCharge, isotopeNumber, x2,y2,x3,y3,z3. And, at last, it handles CML files with multiple molecules and can pick molecules out of CMLReact files. (It ignores molecule elements with the ref attribute.) I started on using cmlpp to output CML, but realised that it a stand-alone function writing the XML for CML2 directly was no longer. This is no doubt heresy to Java programmers with essentially built in support for XML and sounds a bit like people who used to advocate code with GOTOs because it was shorter and "more efficient". But I don't think the crude way would be any more difficult to extend or maintain. However, we could use something more trendy if it offends too much or if you don't like punctuation characters (sometimes up to 9 consecutively!). It could be extended easily but inelegantly to writing CML1. cmlpp has a parser, DOM model and a set of CML classes. I found it a bit uneven and inconsistent, and have had to make small changes. Peter, are you still intending the cmlpp module to be shared with programs other than OB. How much could it be altered? It presumably would use the automatically generated functions described above. The XML namespace handling could be extended. I guess that compound XML documents with objects defined by many schemas will become increasingly common, and OB should provide the capability to abstract the appropriate chemical parts (and not just molecules). Doing namespaces properly would help this. Large documents will need to be handled and I think that we are agreed that SAX parsing is the way to go. We don't want to have to build a DOM of somebody's thesis to abstract a few molecules or spectra. Using libxml2 as a parser would be a safer route than cmlpp. It is written in C but the amount of messy interface would not be great. The C++ wrapper I have seen does SAX1 but not SAX2 and I agree with Geoff that we should go with the basic libxml. The compiled DLL with its C interface seems to work for me in Windows, but I haven't done anything extensive with it. I have not found any examples of libxml2 SAX2 code, which would be useful as a start. How schema-derived code in SAX would be applied is not so clear to me. The SAX events have to be filtered for namespace and then by element name. I was envisaging the interface to the parser being done in a XMLFormat class, which would also do the namespace filtering. The separation by element would be done in derived format classes like (a new) CMLFormat.These classes would be registered at startup with XMLFormat by namespace, like formats are by file extension. Although it looks like it has two dead ends, I've put the cmlpp/DOM reading and standalone writing version of CMLFormat in cvs in src/formats/cmlppformat.cpp. Chris |