#500 Re-open @xml:space - ID: 3554294

closed-accepted
nobody
5
2013-01-26
2013-01-15
John P. McCaskey
No

The explanations added for @xml:space still need work. I'm afraid the author was laboring under misunderstandings about @xml:space comparable to the ones I had on my first of many forays into the whitespace forest.

The three bulleted points are untrue. And the advice about using "preserve" in transcriptions, if taken literally, is sure to mislead.

Some clarifications, reminders, principles:

- XML defines only "preserve" and "default" for @xml:space. It does not have, for example, xsd's "collapse" or "replace."
- "preserve", "collapse", "normalize", and "trim" are different.
- XML does not define "default". If an XML processor's handling of "default" is predictable it is for some other reason -- convention, quirks everyone knows about, programming culture, standard settings, some other spec, whatever.
- There is no such thing as a whitespace character. There are whitespace characters (plural).

Bullet 1:

Yes, the behavior of XML processors is generally predictable, but the expected behavior is not to PRESERVE but to COLLAPSE whitespace. In text nodes, XML considers whitespace significant, but not white space characters. So processors will generally COLLAPSE "carriage return - tab - space - space - space" and treat that string as if it had been one space character. To PRESERVE that sequence means to retain all five characters as is. (REPLACE is to convert it into five space characters.)

Bullet 2:

PRESERVE and COLLAPSE are also different from TRIM and NORMALIZE. Whether the processor will trim -- an that's the crucial and tricky piece in mixed-content documents -- is less predictable. A stock XSL transformation will, but one designed for mixed-content documents might not. But again, no processor will by default preserve -- collapse yes, trim probably (to frequent disappointment) , preserve definitely not.

Bullet 3 and the list example:

It is not true that XML generally assumes whitespace between elements is insignificant. By default XSL must preserve whitespace nodes as significant. To reverse this, the programmer must insert <xsl:strip-space elements="*"/>. Many, many XSL programmers have never dealt with mixed-content XML and very, very few with a hybrid like TEI, where mixed-content and structured vary element by element. Most XSL programmers are trained to work in corporate IT departments where structured data is all they will ever see. They will insert the global strip-space command because that's what they've always done and forget why or even that it is there.

So a lot of XML programming culture assumes whitespace between elements is insignificant, but the programming tools by default assume the exact opposite.

It is wrong to say that "not all processors can detect [the significance of inter-element whitespace] reliably." Processors will treat the whitespace exactly as told. There is no detecting for the processor to do. If <xsl:strip-space> (or some other schema-communicating instruction) tells them to strip space, they will. Otherwise they won't. Authors of vocabularies are responsible for telling processors which elements are mixed-content and which are structured. (Sebastian posted this once, but I don't believe it became a standard part of a release.)

TEI makes such communication difficult, because some elements are mixed-content, some are structured, and -- to make it worse -- some are defined as mixed-content yet treated, even in the Guidelines, as structured. The document will appear conformant but then get corrupted downstream when the consumers assume the element was used as spec'd. (An error like this occurred to one TEI user who contacted me after they corrupted a whole batch of TEI files and didn't realize it soon enough and couldn't even figure out what went wrong.)

The case of transcription:

This sounds incorrect and a trap for the unwary. PRESERVE means "do not REMOVE whitespace characters." It says nothing about INSERTING whitespace between elements where no whitespace exists -- and it shouldn't need to. No consumer of an XML file should ever do that. That would add a node to the tree. Bad. Maybe some application does that as part of a suite of tools, but then that application is not an XML editor. It's an XML changer.

So there should be no need to add 'preserve' just to stop downsteam apps from inserting whitespace nodes. But also, adding 'preserve' to a DIV tells processors to retain every space character, tab, and carriage return anywhere in that DIV. It's hard to imagine a scenario where that is intended.

Anyway, anyway -- the attempt to better explain @xml:space, I'm afraid, makes matters worse. Whitespace handling in XML is difficult enough. In TEI it's even more so. The Guidelines need to be super accurate.

Whoever gets stuck with writing a short bit for that passage really needs to understand, I think, everything it took me so long to figure out and that I put in http://wiki.tei-c.org/index.php/XML_Whitespace -- as well as all that is in the external resources linked there, including the discussion about @xml:space on xml-dev, where some really top XML experts got challenged with how to apply xml:space in an architecture like TEI's.

I don't think the new passage needs just a few tweaks. If the subtleties of @xml:space can trip up someone as experienced as whoever wrote that passage, then general readers need specific, careful, and accurate guidance indeed. I think the passage needs substantive reworking.

Discussion

1 2 3 > >> (Page 1 of 3)
  • note that the resource John mentions _is_ part of the standard release, its called stripspace.xsl.model - see http://www.tei-c.org/Guidelines/P5/

    not commenting (yet) on the rest of the issue.

     
  • Oh, that's excellent, Sebastian. Thanks!

    (typo on that page: "fragmeht")

     
  • James Cummings
    James Cummings
    2013-01-15

    Hi John,

    Would you be willing to write a version of how you think this particular bit of the Guidelines should read? If so please post here. If you happen to be able to do so in the next 18-20 hours or so it might be able to get into the next release of the Guidelines. I'll have to run it passed the rest of the Council of course. There is a basic style guide at http://www.tei-c.org/Activities/Council/Working/tcw24.xml

    -James

     
  • In the next 18-20 hours? I think so. I'll give it a shot.

     
  • James Cummings
    James Cummings
    2013-01-15

    Great! Just post it here and I'll point Council to it when you have. (Obviously we reserve the right to rephrase it even more, etc.) But, as you've pointed out this is a very difficult and thorny issue and we do want to issue clear guidance for TEI users.

    -James

     
  • Here is a draft. I recommend it get its own heading, 1.3.1.1.5 XML Whitespace.

    ------------------------------------------------------

    The global attribute @xml:space provides a mechanism for signaling to users of an XML file how whitespace, that is, consecutive tab (#x09), space (#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be treated. There are only two allowed values, “preserve” and “default”. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is. The second, or just exclusion of the attribute, indicates that whitespace should be handled as appropriate. What is deemed appropriate is left unspecified by the XML Recommendation.

    These Guidelines do not specify a default behavior, but examples in the Guidelines generally presume one of two behaviors. For an element that can contain only other elements, a so-called structured element, it is presumed that whitespace has no sematic significance and can and indeed will be removed. For example, in a <choice> element, such as
    <choice>
    <sic>1724</sic>
    <corr>1728</corr>
    </choice>
    no non-whitespace characters are allowed between opening of the <choice> tag and opening of the <sic> tags or between the <sic> and <corr> tags, so any space there has no significance and can treated as if not there at all. A list of such structured elements is included in the TEI release file stripspace.xsl.model, formatted there for use as an <xsl:strip-space> command for XSL stylesheets.

    For an element that can contain both text and other elements, so-called mixed-content elements, examples in the Guidelines presume that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start tag or immediately before an end tag is deleted. The result is that this encoding,
    <persName>
    <forename>Edward</forename>
    <forename>George</forename>
    <surname type="linked">Bulwer-Lytton</surname>,
    <roleName>Baron Lytton of
    <placeName>Knebworth</placeName>
    </roleName>
    </persName>
    will produce “Edward George Bulwer-Lytton, Baron Lytton of Knebworth”. The space before his name has been removed, a space included between his forenames, the comma preserved, the newlines after his name removed, etc.

    The <persName> element is a mixed-content element; punctuation such as the comma in the above example lies outside any of the child elements. The <address> element, on the other hand, is like <list> and <choice>, a structured element. (These three are included in stripspace.xsl.model.) No punctuation is allowed outside the elements within <address> and the presumed processing behavior is that any space between its components will get removed. An application processing an <address> element is then responsible for adding any necessary space or punctuation between the components of the address. Treating a structured element as a mixed-content one, or vice versa, should be done with care. If it is done, the schema should be customized to record the fact.

    Preference for a default whitespace processing other than that described above can be indicated in <encodingDesc>. Preference for strict retention of every space, tab, carriage return and linefeed character in a text node can be signaled with @xml:space=”preserve”. Had this been done on the <persName> example above, the man’s name would get presented on five lines, indented, and with a blank line following. The @xml:space=”preserve” attribute is rarely used in TEI documents because whatever could be accomplished by doing so is generally accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space>, and others.

     
  • I have marked this up as normal ODD XML, and append it below. I have now read it three times, and I believe it tells the truth

    <div>
    <head>White space</head>
    <p>The global attribute <att>xml:space</att> provides a mechanism for signaling to users
    of an XML file how whitespace, that is, consecutive tab (#x09), space
    (#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be
    treated. There are only two allowed values, <val>preserve</val> and
    <val>default</val>. The first indicates that whitespace in a text node—every
    carriage return, every tab, etc.—should be maintained as is. The second,
    or just exclusion of the attribute, indicates that whitespace should be
    handled as appropriate. What is deemed appropriate is left unspecified by
    the XML Recommendation.</p>
    <p>These Guidelines do not specify a default behavior, but examples in the
    Guidelines generally presume one of two behaviors. For an element that can
    contain only other elements, a so-called structured element, it is presumed
    that whitespace has no semantic significance and can and indeed will be
    removed. For example, in a <gi>choice</gi> element, such as
    <egXML xmlns="http://www.tei-c.org/ns/Examples">
    <choice>
    <sic>1724</sic>
    <corr>1728</corr>
    </choice>
    </egXML>
    no non-whitespace characters are allowed between opening of the <gi>choice</gi>
    tag and opening of the <gi>sic</gi> tags or between the <gi>sic</gi> and <gi>corr</gi> tags, so
    any space there has no significance and can treated as if not there at all.
    A list of such structured elements is included in the TEI release file
    stripspace.xsl.model, formatted there for use as an <gi>xsl:strip-space</gi>
    command for XSL stylesheets.</p>
    <p>For an element that can contain both text and other elements, so-called
    mixed-content elements, examples in the Guidelines presume that whitespace
    will be normalized. This means that all space, carriage return, linefeed,
    and tab characters are converted into spaces, all consecutive spaces are
    then deleted and replaced by one space, and then space immediately after a
    start tag or immediately before an end tag is deleted. The result is that
    this encoding,
    <egXML xmlns="http://www.tei-c.org/ns/Examples">
    <persName>
    <forename>Edward</forename>
    <forename>George</forename>
    <surname type="linked">Bulwer-Lytton</surname>,
    <roleName>Baron Lytton of
    <placeName>Knebworth</placeName>
    </roleName>
    </persName>
    </egXML>
    represents <q>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</q>.
    The space before his name has been removed, a space included between his
    forenames, the comma preserved, the newlines after his name removed, etc. </p>
    <p>The <gi>persName</gi> element is a mixed-content element; punctuation such as the
    comma in the above example lies outside any of the child elements. The
    <gi>address</gi> element, on the other hand, is like <gi>list</gi> and <gi>choice</gi>, a
    structured element. (These three are included in stripspace.xsl.model.) No
    punctuation is allowed outside the elements within <gi>address</gi> and the
    presumed processing behavior is that any space between its components will
    get removed. An application processing an <gi>address</gi> element is then
    responsible for adding any necessary space or punctuation between the
    components of the address. Treating a structured element as a mixed-content
    one, or vice versa, should be done with care. If it is done, the schema
    should be customized to record the fact.</p>
    <p>Preference for a default whitespace processing other than that described
    above can be indicated in <gi>encodingDesc</gi>. Preference for strict retention
    of every space, tab, carriage return and linefeed character in a text node
    can be signaled with <code>xml:space=”preserve”</code>. Had this been done on the
    <gi>persName</gi> example above, the man’s name would get presented on five
    lines, indented, and with a blank line following. The
    <code>xml:space=”preserve”</code> attribute is rarely used in TEI documents because
    whatever could be accomplished by doing so is generally accomplished with
    less risk and more precision by using native TEI elements such as <gi>l</gi>,
    <gi>lb</gi>, <gi>space</gi>, and others.</p>
    </div>

     
  • Kevin Hawkins
    Kevin Hawkins
    2013-01-16

    Proposed text looks fine to me. In the original ticket, it refers to a new passage that was added recently regarding @xml:space. While I could look for old tickets to see what this might be, I think it would be best if John could state in a new comment on this ticket exactly what text he proposes removing from the Guidelines in addition to adding the new section. Then we can really make sure we're all in agreement.

     
  • In the example, be sure the XML is hierarchically indented. In the <choice> example, <sic> and <corr> should be indented. In the <persName> example, all the sub-elements should be indented. If you can make the whole block -- the <choice> and the <persName> -- indented as well, all the better. The indents are needed to illustrate points in the text.

    In the persName example, you changed "produce" to "represents". The point that a certain processing will render a certain encoding a certain way. The point is about how a certain representation will be presented. To me "represents" is a pre-rendering concept. How about "will be rendered as" or "will be presented as"? (The second would match "would get presented" used in the final paragraph.)

    Add the second comma here : "on the other hand, is, like"

    Add this comma: "<gi>address</gi>, and"

     
  • the pretty-printed of <egXML> is automatic, it will do the right thing

    produce vs represents: we cant talk about output here. the question is to do
    with what you see in the original, and how you are encoding it. but i see what you mean.
    its hard.

     
1 2 3 > >> (Page 1 of 3)