Markup pretty-printer Wiki

OmniMark library to safely indent markup for SGML or XML serialization

Status: Beta

Brought to you by: omnimark-code

Serializing XML

As even the birds know by now, XML is much simpler than (and therefore obviously superior to) SGML. All the things we need to take care of have already been listed in the introduction. Let's handle them one by one.

First, we need to escape all the meta-characters present in the content, and translate rules are perfect for the purpose:

translate "&"
   output "&amp;"
 
translate "<"
   output "&lt;"

Some people prefer to escape the greater-than characters as well, for symmetry, but we'll keep our serializer minimal. To escape non-ASCII characters in the input, we need another translate rule. We shall assume the input is UTF-8 encoded, so we need to import the omutf8 library module:

include "utf8pat.xin"
 
translate (lookahead [any \ "%10#%13#" | "%32#" to "%127#"]) (utf8-char => c)
   output "&" || "d" % utf8-char-number c || ";"

The translate rules above can handle both the element content and the attribute values, because they largely require the same treatment. The following rule, however, is better invoked for attribute values only, because the quotes and newlines in the content look much better when they're left unescaped. Inside an attribute, on the other hand, an unescaped quote character would cause a markup error, while a tab or a newline character would be indistinguishable from a space character.

group "translate attributes"
translate '"'
   output "&quot;"
 
translate ["%9#%10#%13#"] => whitespace-char
   output "&#x" || "16rd" % binary whitespace-char || ";"
 
group #implied

In order to actually activate these translate rules on attribute values, we must replace the attribute a expression in our element rule by "%v(a)" and select the proper group. And we can take care of the empty elements at the same time:

element #implied
   output "<%q"
   using group "translate attributes"
   repeat over specified attributes as a
      output " " || key of attribute a || '="%v(a)"'
   again
   output content is empty-tag -> "/>%c" | ">%c</%q>"

Elements and textual content are not the only things we need to handle. If there are any external entity references in the input, they must be processed. External data entities are easier to handle; the following rule simply reproduces the original entity reference:

external-data-entity #implied
   output "&%q;"

Let us now look into external text entities. In absence of any user-defined rule, OmniMark would attempt to expand the entity value and we'd get a monolithic XML document as the result. We may instead want to reproduce the top-level document with unexpanded entity references. In that case, we cannot simply output a reference to the same entity, because the output of an external-text-entity rule is fed back to the parser. The effect would be a self-referential loop, which the parser would detect and report as a markup error. To prevent this, we must redirect the output to the same stream where the parser output is going. If that stream is #main-output, for example, the rule would be:

external-text-entity #implied when entity is general
   put #main-output "&%q;"

Our markup is now guaranteed to be well-formed. We can stop at this point, and we'll have a perfectly good XML normalizer: the output won't be quite the same as the input, but to most applications out there that don't care for comments and processsing instructions it will look exactly the same. The rest is optional, and depends on what kind of output is needed.

If any CDATA marked sections were present in the input, our normalizer would convert them to normal content with individual characters escaped. If we want to preserve the marked sections, we need the following rule:

marked-section cdata
   output "<![CDATA["
   repeat scan "%zc"
   match any ++ => rest lookahead ("]]>" | value-end)
      output rest
   match "]]>"
      not-reached message "A CDATA marked section cannot contain the string ']]>'"
   again
   output "]]>"

The rule uses %zc instead of %c because the latter would run the translate rules on the content of the marked section, which is not what we want.

Finally, we need a few more rules if we want to reproduce all the comments and processing instruction exactly the way they were in the input XML document:

markup-comment
   output "<!--"
   repeat scan "%zc"
   match [any \ "-"]+ => rest
      output rest
   match "-" lookahead ("-" | value-end)
      not-reached message "An XML comment cannot contain the string '--'"
   match "-"
      output "-"
   again
   output "-->"
   output "%n"
      when number of current elements = 0
 
processing-instruction any* => pi
   output "<?%g(pi)?>"
   output "%n"
      when number of current elements = 0

All the rules above still can't reproduce the input XML document exactly. There are some things that the XML parser simply cannot pass to the application. For example, we assume that there is only a single space before each attribute, but in fact there could have been any amount of whitespace there. We also cannot tell if single or double quotes were used for attribute values. Finally, the internal entity references are silently replaced by the XML parser and we have no way of reproducing them.

One thing that we can guarantee is that the XML serialization process is idempotent: if we re-parse and re-serialize the serialized XML output, we'll get the same result.