Menu

#500 Re-open @xml:space - ID: 3554294

closed-accepted
nobody
5
2013-01-26
2013-01-15
No

The explanations added for @xml:space still need work. I'm afraid the author was laboring under misunderstandings about @xml:space comparable to the ones I had on my first of many forays into the whitespace forest.

The three bulleted points are untrue. And the advice about using "preserve" in transcriptions, if taken literally, is sure to mislead.

Some clarifications, reminders, principles:

- XML defines only "preserve" and "default" for @xml:space. It does not have, for example, xsd's "collapse" or "replace."
- "preserve", "collapse", "normalize", and "trim" are different.
- XML does not define "default". If an XML processor's handling of "default" is predictable it is for some other reason -- convention, quirks everyone knows about, programming culture, standard settings, some other spec, whatever.
- There is no such thing as a whitespace character. There are whitespace characters (plural).

Bullet 1:

Yes, the behavior of XML processors is generally predictable, but the expected behavior is not to PRESERVE but to COLLAPSE whitespace. In text nodes, XML considers whitespace significant, but not white space characters. So processors will generally COLLAPSE "carriage return - tab - space - space - space" and treat that string as if it had been one space character. To PRESERVE that sequence means to retain all five characters as is. (REPLACE is to convert it into five space characters.)

Bullet 2:

PRESERVE and COLLAPSE are also different from TRIM and NORMALIZE. Whether the processor will trim -- an that's the crucial and tricky piece in mixed-content documents -- is less predictable. A stock XSL transformation will, but one designed for mixed-content documents might not. But again, no processor will by default preserve -- collapse yes, trim probably (to frequent disappointment) , preserve definitely not.

Bullet 3 and the list example:

It is not true that XML generally assumes whitespace between elements is insignificant. By default XSL must preserve whitespace nodes as significant. To reverse this, the programmer must insert <xsl:strip-space elements="*"/>. Many, many XSL programmers have never dealt with mixed-content XML and very, very few with a hybrid like TEI, where mixed-content and structured vary element by element. Most XSL programmers are trained to work in corporate IT departments where structured data is all they will ever see. They will insert the global strip-space command because that's what they've always done and forget why or even that it is there.

So a lot of XML programming culture assumes whitespace between elements is insignificant, but the programming tools by default assume the exact opposite.

It is wrong to say that "not all processors can detect [the significance of inter-element whitespace] reliably." Processors will treat the whitespace exactly as told. There is no detecting for the processor to do. If <xsl:strip-space> (or some other schema-communicating instruction) tells them to strip space, they will. Otherwise they won't. Authors of vocabularies are responsible for telling processors which elements are mixed-content and which are structured. (Sebastian posted this once, but I don't believe it became a standard part of a release.)

TEI makes such communication difficult, because some elements are mixed-content, some are structured, and -- to make it worse -- some are defined as mixed-content yet treated, even in the Guidelines, as structured. The document will appear conformant but then get corrupted downstream when the consumers assume the element was used as spec'd. (An error like this occurred to one TEI user who contacted me after they corrupted a whole batch of TEI files and didn't realize it soon enough and couldn't even figure out what went wrong.)

The case of transcription:

This sounds incorrect and a trap for the unwary. PRESERVE means "do not REMOVE whitespace characters." It says nothing about INSERTING whitespace between elements where no whitespace exists -- and it shouldn't need to. No consumer of an XML file should ever do that. That would add a node to the tree. Bad. Maybe some application does that as part of a suite of tools, but then that application is not an XML editor. It's an XML changer.

So there should be no need to add 'preserve' just to stop downsteam apps from inserting whitespace nodes. But also, adding 'preserve' to a DIV tells processors to retain every space character, tab, and carriage return anywhere in that DIV. It's hard to imagine a scenario where that is intended.

Anyway, anyway -- the attempt to better explain @xml:space, I'm afraid, makes matters worse. Whitespace handling in XML is difficult enough. In TEI it's even more so. The Guidelines need to be super accurate.

Whoever gets stuck with writing a short bit for that passage really needs to understand, I think, everything it took me so long to figure out and that I put in http://wiki.tei-c.org/index.php/XML_Whitespace -- as well as all that is in the external resources linked there, including the discussion about @xml:space on xml-dev, where some really top XML experts got challenged with how to apply xml:space in an architecture like TEI's.

I don't think the new passage needs just a few tweaks. If the subtleties of @xml:space can trip up someone as experienced as whoever wrote that passage, then general readers need specific, careful, and accurate guidance indeed. I think the passage needs substantive reworking.

Discussion

  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-15

    note that the resource John mentions _is_ part of the standard release, its called stripspace.xsl.model - see http://www.tei-c.org/Guidelines/P5/

    not commenting (yet) on the rest of the issue.

     
  • John P. McCaskey

    Oh, that's excellent, Sebastian. Thanks!

    (typo on that page: "fragmeht")

     
  • James Cummings

    James Cummings - 2013-01-15

    Hi John,

    Would you be willing to write a version of how you think this particular bit of the Guidelines should read? If so please post here. If you happen to be able to do so in the next 18-20 hours or so it might be able to get into the next release of the Guidelines. I'll have to run it passed the rest of the Council of course. There is a basic style guide at http://www.tei-c.org/Activities/Council/Working/tcw24.xml

    -James

     
  • John P. McCaskey

    In the next 18-20 hours? I think so. I'll give it a shot.

     
  • James Cummings

    James Cummings - 2013-01-15

    Great! Just post it here and I'll point Council to it when you have. (Obviously we reserve the right to rephrase it even more, etc.) But, as you've pointed out this is a very difficult and thorny issue and we do want to issue clear guidance for TEI users.

    -James

     
  • John P. McCaskey

    Here is a draft. I recommend it get its own heading, 1.3.1.1.5 XML Whitespace.

    ------------------------------------------------------

    The global attribute @xml:space provides a mechanism for signaling to users of an XML file how whitespace, that is, consecutive tab (#x09), space (#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be treated. There are only two allowed values, “preserve” and “default”. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is. The second, or just exclusion of the attribute, indicates that whitespace should be handled as appropriate. What is deemed appropriate is left unspecified by the XML Recommendation.

    These Guidelines do not specify a default behavior, but examples in the Guidelines generally presume one of two behaviors. For an element that can contain only other elements, a so-called structured element, it is presumed that whitespace has no sematic significance and can and indeed will be removed. For example, in a <choice> element, such as
    <choice>
    <sic>1724</sic>
    <corr>1728</corr>
    </choice>
    no non-whitespace characters are allowed between opening of the <choice> tag and opening of the <sic> tags or between the <sic> and <corr> tags, so any space there has no significance and can treated as if not there at all. A list of such structured elements is included in the TEI release file stripspace.xsl.model, formatted there for use as an <xsl:strip-space> command for XSL stylesheets.

    For an element that can contain both text and other elements, so-called mixed-content elements, examples in the Guidelines presume that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start tag or immediately before an end tag is deleted. The result is that this encoding,
    <persName>
    <forename>Edward</forename>
    <forename>George</forename>
    <surname type="linked">Bulwer-Lytton</surname>,
    <roleName>Baron Lytton of
    <placeName>Knebworth</placeName>
    </roleName>
    </persName>
    will produce “Edward George Bulwer-Lytton, Baron Lytton of Knebworth”. The space before his name has been removed, a space included between his forenames, the comma preserved, the newlines after his name removed, etc.

    The <persName> element is a mixed-content element; punctuation such as the comma in the above example lies outside any of the child elements. The <address> element, on the other hand, is like <list> and <choice>, a structured element. (These three are included in stripspace.xsl.model.) No punctuation is allowed outside the elements within <address> and the presumed processing behavior is that any space between its components will get removed. An application processing an <address> element is then responsible for adding any necessary space or punctuation between the components of the address. Treating a structured element as a mixed-content one, or vice versa, should be done with care. If it is done, the schema should be customized to record the fact.

    Preference for a default whitespace processing other than that described above can be indicated in <encodingDesc>. Preference for strict retention of every space, tab, carriage return and linefeed character in a text node can be signaled with @xml:space=”preserve”. Had this been done on the <persName> example above, the man’s name would get presented on five lines, indented, and with a blank line following. The @xml:space=”preserve” attribute is rarely used in TEI documents because whatever could be accomplished by doing so is generally accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space>, and others.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    I have marked this up as normal ODD XML, and append it below. I have now read it three times, and I believe it tells the truth

    <div>
    <head>White space</head>
    <p>The global attribute <att>xml:space</att> provides a mechanism for signaling to users
    of an XML file how whitespace, that is, consecutive tab (#x09), space
    (#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be
    treated. There are only two allowed values, <val>preserve</val> and
    <val>default</val>. The first indicates that whitespace in a text node—every
    carriage return, every tab, etc.—should be maintained as is. The second,
    or just exclusion of the attribute, indicates that whitespace should be
    handled as appropriate. What is deemed appropriate is left unspecified by
    the XML Recommendation.</p>
    <p>These Guidelines do not specify a default behavior, but examples in the
    Guidelines generally presume one of two behaviors. For an element that can
    contain only other elements, a so-called structured element, it is presumed
    that whitespace has no semantic significance and can and indeed will be
    removed. For example, in a <gi>choice</gi> element, such as
    <egXML xmlns="http://www.tei-c.org/ns/Examples">
    <choice>
    <sic>1724</sic>
    <corr>1728</corr>
    </choice>
    </egXML>
    no non-whitespace characters are allowed between opening of the <gi>choice</gi>
    tag and opening of the <gi>sic</gi> tags or between the <gi>sic</gi> and <gi>corr</gi> tags, so
    any space there has no significance and can treated as if not there at all.
    A list of such structured elements is included in the TEI release file
    stripspace.xsl.model, formatted there for use as an <gi>xsl:strip-space</gi>
    command for XSL stylesheets.</p>
    <p>For an element that can contain both text and other elements, so-called
    mixed-content elements, examples in the Guidelines presume that whitespace
    will be normalized. This means that all space, carriage return, linefeed,
    and tab characters are converted into spaces, all consecutive spaces are
    then deleted and replaced by one space, and then space immediately after a
    start tag or immediately before an end tag is deleted. The result is that
    this encoding,
    <egXML xmlns="http://www.tei-c.org/ns/Examples">
    <persName>
    <forename>Edward</forename>
    <forename>George</forename>
    <surname type="linked">Bulwer-Lytton</surname>,
    <roleName>Baron Lytton of
    <placeName>Knebworth</placeName>
    </roleName>
    </persName>
    </egXML>
    represents <q>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</q>.
    The space before his name has been removed, a space included between his
    forenames, the comma preserved, the newlines after his name removed, etc. </p>
    <p>The <gi>persName</gi> element is a mixed-content element; punctuation such as the
    comma in the above example lies outside any of the child elements. The
    <gi>address</gi> element, on the other hand, is like <gi>list</gi> and <gi>choice</gi>, a
    structured element. (These three are included in stripspace.xsl.model.) No
    punctuation is allowed outside the elements within <gi>address</gi> and the
    presumed processing behavior is that any space between its components will
    get removed. An application processing an <gi>address</gi> element is then
    responsible for adding any necessary space or punctuation between the
    components of the address. Treating a structured element as a mixed-content
    one, or vice versa, should be done with care. If it is done, the schema
    should be customized to record the fact.</p>
    <p>Preference for a default whitespace processing other than that described
    above can be indicated in <gi>encodingDesc</gi>. Preference for strict retention
    of every space, tab, carriage return and linefeed character in a text node
    can be signaled with <code>xml:space=”preserve”</code>. Had this been done on the
    <gi>persName</gi> example above, the man’s name would get presented on five
    lines, indented, and with a blank line following. The
    <code>xml:space=”preserve”</code> attribute is rarely used in TEI documents because
    whatever could be accomplished by doing so is generally accomplished with
    less risk and more precision by using native TEI elements such as <gi>l</gi>,
    <gi>lb</gi>, <gi>space</gi>, and others.</p>
    </div>

     
  • Kevin Hawkins

    Kevin Hawkins - 2013-01-16

    Proposed text looks fine to me. In the original ticket, it refers to a new passage that was added recently regarding @xml:space. While I could look for old tickets to see what this might be, I think it would be best if John could state in a new comment on this ticket exactly what text he proposes removing from the Guidelines in addition to adding the new section. Then we can really make sure we're all in agreement.

     
  • John P. McCaskey

    In the example, be sure the XML is hierarchically indented. In the <choice> example, <sic> and <corr> should be indented. In the <persName> example, all the sub-elements should be indented. If you can make the whole block -- the <choice> and the <persName> -- indented as well, all the better. The indents are needed to illustrate points in the text.

    In the persName example, you changed "produce" to "represents". The point that a certain processing will render a certain encoding a certain way. The point is about how a certain representation will be presented. To me "represents" is a pre-rendering concept. How about "will be rendered as" or "will be presented as"? (The second would match "would get presented" used in the final paragraph.)

    Add the second comma here : "on the other hand, is, like"

    Add this comma: "<gi>address</gi>, and"

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    the pretty-printed of <egXML> is automatic, it will do the right thing

    produce vs represents: we cant talk about output here. the question is to do
    with what you see in the original, and how you are encoding it. but i see what you mean.
    its hard.

     
  • John P. McCaskey

    To answer Kevin:

    I propose that in the version of 1.3.1.1.4 Other global attributes, contained in this revision: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#STEC, my submission here replace the passage beginning "The XML Recommendation defines whitespace as . . ." and ending " . . . is not then permitted to introduce additional spaces," (which is the very end of 1.3.1.1.4).

    I also propose that 1.3.1.1.4 be split into one section for @xml:base and one for @xml:space. Sebastian's proposal that the second be headed "White space" is fine.

    In the earlier Feature Request ticket, 3554294, I suggested the entry for @xml:space on the reference page for global attributes be changed to this:

    ----

    default: A processor should treat XML whitespace as appropriate.

    preserve: A processor should preserve all XML whitespace.

    Note:

    Use of "preserve" is seldom appropriate in TEI documents. Whatever could be accomplished by using it is usually accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space> and others.

    Neither XML nor TEI specify default processing for XML whitespace. Examples in TEI 5 Guidelines, however, presume that such whitespace will be normalized according to conventional rules for normalizing whitespace in mixed-content XML structures. If encoders expect applications to process whitespace otherwise, this should be noted in <encodingStmt>. Unless indicated otherwise by such a statement or by xml:space='preserve', consumers of TEI encodings should normalize XML whitespace.

    For further background and recommendations, see XML Whitespace in the TEI Wiki (http://wiki.tei-c.org/index.php/XML_Whitespace) and the XML specification (http://www.w3.org/TR/REC-xml/#sec-white-space).

    ----

    This change to the reference page may not be necessary if the Guidelines are being changed, though I do still think there is value in a link to the Wiki page, a fuller explication intended more for schema-designers and authors of processing software, the kind of people who would read the reference page and not just the Guidelines. I just don't know if links to the Wiki are acceptable.

     
  • John P. McCaskey

    > produce vs represents: we cant talk about output here.

    But of course it is that very boundary between representation and presentation that is here our subject. And that sentence in particular is telling the reader what will happen on the presentation side of that line. If you are OK with "would get presented as" in the final paragraph then you have to be OK with "would be presented as" in the earlier one.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    i take the the point.

    i am bringing the prose up in the pre-release so
    we can see it in context

     
  • John P. McCaskey

    The spellings "behavior" and "behaviour" are both used.

    Curly quote marks around one of the xml:space=”preserve” should be straightened.

    I'm disappointed 'If it is done, the schema should be customized to record the fact." was removed. It seems wrong for Guidelines to condone use of an element in a way that would violate an element's declaration. "Oh, go ahead, just be careful" seems too mild.

    Should 1.3.1.1.4 now be renamed?

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    the reason for taking out "the schema should be customized to record
    the fact" was that we dont explain what that actually means, or how to do it,
    so it seemed safer to omit now; we can expand on it in a later release if we can find
    suitable wording.

    this process isnt ideal, editing prose just hours before a release, so my feeling at least
    is to be cautious. its not the last chance at it ever, after all.

    but others are editing it as I speak.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    in the example
    <persName>
    <forename>Edward</forename>
    <forename>George</forename>
    <surname type="linked">Bulwer-Lytton</surname>,
    <roleName>Baron Lytton of
    <placeName>Knebworth</placeName>
    </roleName>
    </persName>
    why do we say "The space before his name has been removed"? surely that whitespace after <persName> before <forename> is as valid as any other in here?

     
  • John P. McCaskey

    Actually no. In the standard algorithm for normalizing whitespace in mixed-content elements, that space would be trimmed.

     
  • John P. McCaskey

    The first text node in an element gets left-trimmed. The last text node in an element gets right-trimmed. A text node that is both first and last, i.e., is the only node in the element, gets left- and right-trimmed.

     
  • John P. McCaskey

    I should have said: If a text node is first (or last) in an element, it gets left- (or right-) trimmed.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-01-16

    no, cancel that, its still building :-{

     
  • Lou Burnard

    Lou Burnard - 2013-01-26

    Thanks for proposing a revision, which is now in the latest release. Closing this ticket on the assumption the revision is at least not wrong, or not wrong in the same way.

     
  • Lou Burnard

    Lou Burnard - 2013-01-26
    • status: open --> closed-accepted