#374 @xml:space

John P. McCaskey

I suggest the following changes to the documentation for @xml:space:

default: A processor should treat XML whitespace as appropriate.

preserve: A processor should preserve all XML whitespace.


Use of "preserve" is seldom appropriate in TEI documents. Whatever could be accomplished by using it is usually accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space> and others.

Neither XML nor TEI specify default processing for XML whitespace. Examples in TEI 5 Guidelines, however, presume that such whitespace will be normalized according to conventional rules for normalizing whitespace in mixed-content XML structures. If encoders expect applications to process whitespace otherwise, this should be noted in <encodingStmt>. Unless indicated otherwise by such a statement or by xml:space='preserve', consumers of TEI encodings should normalize XML whitespace.

For further background and recommendations, see XML Whitespace in the TEI Wiki (http://wiki.tei-c.org/index.php/XML_Whitespace) and the XML specification (http://www.w3.org/TR/REC-xml/#sec-white-space).


  • Lou Burnard
    Lou Burnard

    • milestone: --> GREEN
    • status: open --> open-accepted
  • Lou Burnard
    Lou Burnard

    • assigned_to: nobody --> rwelzenb
  • compare to http://purl.org/TEI/BUGS/3223636

    The language explaining @default and @preserve has been clarified, but not exactly as John proposes here. Instead, we took the language directly from the XML spec:

    default: signals that the application's default white-space processing modes are acceptable
    preserve: indicates the intent that applications preserve all white space

    The meaningful difference here is in the explanation for default. I think the version we used is preferable to John's above because it is clear that an application's handling of whitespace is based on its default behavior (as opposed to an undefined notion of "appropriate").

    Based on the comments on the other ticket, we added a note suggesting somewhat ominously/cryptically that many parsers don't handle xml:space correctly. But now I don't think that sentence is sufficient enough to be useful. I thought of adding something like "Therefore, although xml:space signals an intention about whitespace, it cannot guarantee that whitespace really will be handled as indicated" but that seemed redundant.

    John's suggested note above certainly clarifies the situation, but to me it feels quite heavy-handed for an attribute spec. It seems like the spec should indicate what is allowed, and not, but not comment on how typical it is to use this attribute. If we incorporate this language I recommend that we instead add it to the prose in of the Guidelines.

  • I agree it is more appropriate to discuss this in the Guidelines rather than in just the spec and a wiki.

    There are two big problems that I am trying to address.

    The first is that there really are landmines here. If the encoding team and the stylesheet writers are working closely enough together and everyone is willing to take a try-this-and-see-what-happens, we'll-find-workarounds-when-something-breaks approach, the mines can be circumvented.

    But the situation is then precarious. If another team tries reading those TEI file, things can blow up. Indeed, after I broached this on the mailing list, I got a message from someone saying: Wow, thanks! Now I see why we had a whole bunch of what we thought were fine TEI files get subtly but then irreversibly corrupted in a transfer to a new system.

    The situation is even more precarious given the conventions that have developed among (I'd say most) XSL programmers -- conventions that mis-process TEI files. So the best practice now is: If you XML encoders are assuming some specific space handling, you darn well better tell us programmers what it is.

    The second problem is that examples in the Guidelines *universally assume space will get processed a certain way and this is nowhere documented*. I think that is a big problem.

    I got to all of this because I got bit. I am writing a stylesheet intended to work with TEI files I don't control. I finally figured out that those files were expecting me to do something I wasn't doing and which was nowhere documented. It took me long to figure out the unspecified algorithm. It's a standard algorithm, but it took me long to find.

    The best practice for definers of XML vocabularies is to tell producers how space is to be encoded and tell producers what that is or, if that can't be standardized (I think it easily can be for TEI, as the examples in the Guidelines indicate) then provide a mechanism for the coordination.

  • James Cummings
    James Cummings

    Council (Oxford 2012-09) accepts that clarification may be necessary, but some has already taken place with regard to @xml:space for the next release. RW is assigned to double check the changes from an earlier ticket on this and report back to Council if additional changes are needed before implementing them.

    • status: open-accepted --> closed-accepted