Markup pretty-printer Wiki

OmniMark library to safely indent markup for SGML or XML serialization

Status: Beta

Brought to you by: omnimark-code

Serializing SGML

From the point of view of a mere serializer, SGML is a very different beast from XML. A parsed SGML document cannot be exactly reproduced. That is true of XML to a lesser extent, as we have already seen. With SGML, however, there are many more things that parser either elides from the raw input or inserts where they should have been:

There are many shortcuts available to an SGML document author that are invisible on the other side of the parser: the element

<greeting>Hello, World!</greeting>

for example, could also be written as

<greeting>Hello, World!</>

or even

<greeting/Hello, World!/

An entire element tag can be left out if declared as omissible in the DTD. The SGML parser will happily insert it wherever it's presence is required.
Single-token attribute values do not need to be quoted.
Shortrefs may be used instead of standard element tags.
SGML is case-insensitive by default, and the parser reports all tokens as though they were uppercased.

For the reasons listed above and others beside, the generated SGML will by necessity be heavily normalized and XML-like. Our main goal will be to make it valid.

There are four additional considerations in element tag formatting compared to the XML case:

If the element is declared as empty or if a conref attribute is specified, the end tag must not be output.
For elements with declared content model of cdata, instead of translating the content we merely make sure it contains no </ delimiters.
If there are any shortrefs associated with the element, we turn them off.
And finally, a small concession for human readers: we insert newlines between element tags where the SGML parser has stripped all whitespace from the input.

The resulting element rule has grown quite a bit:

element #implied
   output "<%q"
   using group "translate attributes"
   repeat over specified attributes as a
      output " " || key of attribute a || '="%v(a)"'
   again
   output ">"
   output "<!USEMAP #EMPTY>"
      unless usemap is (#empty | #none)
   output "%n"
      when content is element
 
   do when content is cdata
      repeat scan "%zc"
      match any ++ => rest lookahead "</"
         output rest
      match "</"
         not-reached message "A CDATA element content cannot contain the string '</'"
      again
   else
      output "%c"
   done
   output "</%q>"
      unless content is (empty | conref)
   output "%n"
      when content of parent is element