Markup pretty-printer Wiki

OmniMark library to safely indent markup for SGML or XML serialization

Status: Beta

Brought to you by: omnimark-code

Home

Printing Markup Prettily

Any OmniMark veteran has probably written (or copied) the following OmniMark element rule hundreds of times:

element #implied
   output "<%q"
   repeat over specified attributes as a
      output " " || key of attribute a || "=%"" || attribute a || "%""
   again
   output ">%c</%q>"

This simple rule attempts to reproduce the original markup in textual form. That is not exactly what it does, however. There are several corner cases which would make the output malformed.

An attribute value that contains quotes, ampersands, or less-than characters would lead to malformed start tags. For example,

<a text="He yelled &quot;help&quot;">

would become

<a text="He yelled "help"">

The original element content could contain ampersands or less-than characters which would also lead to malformed printout:

<a>3*a&lt;b</a>

would be formatted as

<a>3*a<b</a>

There are other things that could go wrong:

You may want to escape all non-ASCII characters using either numeric character references or named entities. In fact, tab and newline characters in attributes must be escaped, otherwise the XML parser will replace them by spaces.
If the element was empty, we should output an empty tag. In case of an XML document instance, this is a matter of aesthetics; but if we're trying to reproduce an SGML document, empty elements are not allowed to have end-tags.
The original markup may have contained comments, processing instructions, entity references, or marked sections. They would be lost in the output.

Even if we take care to always produce well-formed markup and thus make its parsers happy with the result, humans can find it difficult to read. At this point, however, we're moving from the realm of serializing markup so it can be fed to a parser, to its pretty printing for human consumption. The rest of this article will discuss both tasks, covering XML and SGML in turn.

Project Admins:

OmniMark Code