From: Clark C . E. <cc...@cl...> - 2002-07-18 13:41:24
|
A small talk as to why YAML is better than XML for data serialization. ----- Forwarded message from "Clark C . Evans" <cc...@cl...> ----- Date: Thu, 18 Jul 2002 09:26:48 -0400 From: "Clark C . Evans" <cc...@cl...> To: James Kew <jam...@bt...> Cc: pyt...@py... Subject: Re: XML overuse? (was Re: Python to XML to Python conversion) On Wed, Jul 17, 2002 at 11:52:50PM +0100, James Kew wrote: | > The information model fits documents well, but is a poor match | > for object serialization, which is 90% of the use cases | > programmers face. | | Um: 90%? What sort of use cases do you see programmers forcing XML into? | Just curious: I fall into the "XML as poor-man's database/parser" camp at | the moment but I'm finding that for a poor man's solution it does quite a | good job with not much programmer effort to glue it together. XML has its roots in structured document processing, and is a descendent of SGML. For example, the research reports by Gartner Group are primarly text, but there are specific tags to mark-up features of the report: chapters to generate a table of contents, keywords to make an index, vendor names to enable better searching, etc. The reports are highly structured with tags, each tag having a beginning and an ending. Furthermore, there is also information which must be attached to a particular series of characters, such as an editorial comment, but must not appear in print. All in all, structured document processing is a rather complicated beast and SGML set out to tackle this problem. SGML thus had many features which supported these requirements. It had attributes for out-of-band information which must be attached to a sequence of characters but not be printed. It allows for mixed content, so that a paragraph for instance can contain a series of untagged characters followed by a series of characters tagged bold. Also, SGML allows for named lists, so that a chapter could be defined, for example, as a series of tables, paragraphs and figures. SGML is also character based, since documents are in essence a large blob of characters "marked-up" SGML also had lots of features which enabled human-editing, it allowed you to skip end tags, it even allowed you to skip intermediate tags so that if a chapter couldn't contain characters directly (characters must be wrapped with a paragraph) the parser would implicitly include a paragraph anyway. These extra syntax features did wonders for SGML's flexibility and in no small way were responsible for HTML's success. However, the implicit and missing end tags required that a parser know the document type definition before it could parse an SGML text. Further, these implicit thingys made it hard to write parsers. Therefore, there was a simplification movement in HTML land where the strcutral features (attributes, mixed content, named list) components of SGML were kept but the features which required a DTD and made parsing complex (implicit tags, optional end tags, etc) were dropped. This simplified SGML was dubbed XML and was then markeded as HTML-next-generation. The marketing for XML has been enormous, but at it's core, it is still primarly a structured document markup language. Due to XML's popularity, lots of people have tried to get it to work for other things. A few people have made XML databases and others have used XML for object serialization and invocation (SOAP/XML-RPC) and it has had many other uses. However, most of these uses tend to use a vastly simplified subset of XML and indeed impose additional constraints on XML as far as particular attributes, etc. These additional attributes/constrains are often needed to model native datastructures of modern languages and they include: (a) a way to specify node type, (b) a way to express that a node occurs more than once in the graph-serialized-as-a-tree, (c) a manner to restrict mixed content which does not usually occur in modern languages, (d) restrictions on named list model are also common. However, even with these constraints and fix-ups, at its heart, XML is a much more complicated beast and this complexity is reflected in the DOM and SAX interface. Since this is the primary interface used by programmers, programmers must grapple with documentisms even if they don't need structured document features. In summary, I'm not saying that XML is bad. It's is fantastic for structured document processing (I have direct experience here). However, just beacuse it has had great success in this domain doesn't mean that this success will be long-lasting in other domains. I see people using XML for lots of purposes it was never designed for; certainly it is flexible enough to do it, but the question is: At what price? With XML the price is pretty steep, especially for "object serialization" requirements where attributes, mixed-content, and named-lists arn't needed and where other things such as typing, graph links, map/lists, and treating characters as a whole scalar (rather than as chunks of characters) is what you want. So, that said, YAML (YAML Ain't Markup Language, http://yaml.org) was designed to meet the needs of object serialization directly. In this domain, I must say, it is much much better than XML. Just like in the document serialization domain for which XML was designed, YAML would not work very well at all... YAML isn't markup. In YAML you have dictionaries, lists, and scalars; you don't have chararacters that are tagged. The difference may seem subtle, but the actual impact is huge. It's a completely different mid-set. For a programmer with serialization needs, YAML fits the bill perfectly while XML requires quite a bit of effort to make work. The only down side of YAML is that it isn't buzz-word compliant and the implementation's aren't quite mature yet. The implementations will come along (the native python one isn't bad at all). And hopefully buzz-word compliance will come along eventually, till then there is a subsetted XML mapping of YAML (http://yaml.org/xml.html) which you can use. I'll patch up the python parser to read/write from this XML format within a few more weeks. This way those team members which have to have to be buzz-word compliant can do so. For my day job, I'm more interested in getting the job done... Best, Clark ----- End forwarded message ----- -- Clark C. Evans Axista, Inc. http://www.axista.com 800.926.5525 XCOLLA Collaborative Project Management Software |