Text Encoding Initiative / Bugs / #500 Re-open @xml:space

Sebastian Rahtz - 2013-01-15

note that the resource John mentions _is_ part of the standard release, its called stripspace.xsl.model - see http://www.tei-c.org/Guidelines/P5/

not commenting (yet) on the rest of the issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-15

Oh, that's excellent, Sebastian. Thanks!

(typo on that page: "fragmeht")

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Cummings - 2013-01-15

Hi John,

Would you be willing to write a version of how you think this particular bit of the Guidelines should read? If so please post here. If you happen to be able to do so in the next 18-20 hours or so it might be able to get into the next release of the Guidelines. I'll have to run it passed the rest of the Council of course. There is a basic style guide at http://www.tei-c.org/Activities/Council/Working/tcw24.xml

-James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-15

In the next 18-20 hours? I think so. I'll give it a shot.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Cummings - 2013-01-15

Great! Just post it here and I'll point Council to it when you have. (Obviously we reserve the right to rephrase it even more, etc.) But, as you've pointed out this is a very difficult and thorny issue and we do want to issue clear guidance for TEI users.

-James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-15

Here is a draft. I recommend it get its own heading, 1.3.1.1.5 XML Whitespace.

------------------------------------------------------

The global attribute @xml:space provides a mechanism for signaling to users of an XML file how whitespace, that is, consecutive tab (#x09), space (#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be treated. There are only two allowed values, “preserve” and “default”. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is. The second, or just exclusion of the attribute, indicates that whitespace should be handled as appropriate. What is deemed appropriate is left unspecified by the XML Recommendation.

These Guidelines do not specify a default behavior, but examples in the Guidelines generally presume one of two behaviors. For an element that can contain only other elements, a so-called structured element, it is presumed that whitespace has no sematic significance and can and indeed will be removed. For example, in a <choice> element, such as
<choice>
<sic>1724</sic>
<corr>1728</corr>
</choice>
no non-whitespace characters are allowed between opening of the <choice> tag and opening of the <sic> tags or between the <sic> and <corr> tags, so any space there has no significance and can treated as if not there at all. A list of such structured elements is included in the TEI release file stripspace.xsl.model, formatted there for use as an <xsl:strip-space> command for XSL stylesheets.

For an element that can contain both text and other elements, so-called mixed-content elements, examples in the Guidelines presume that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start tag or immediately before an end tag is deleted. The result is that this encoding,
<persName>
<forename>Edward</forename>
<forename>George</forename>
<surname type="linked">Bulwer-Lytton</surname>,
<roleName>Baron Lytton of
<placeName>Knebworth</placeName>
</roleName>
</persName>
will produce “Edward George Bulwer-Lytton, Baron Lytton of Knebworth”. The space before his name has been removed, a space included between his forenames, the comma preserved, the newlines after his name removed, etc.

The <persName> element is a mixed-content element; punctuation such as the comma in the above example lies outside any of the child elements. The <address> element, on the other hand, is like <list> and <choice>, a structured element. (These three are included in stripspace.xsl.model.) No punctuation is allowed outside the elements within <address> and the presumed processing behavior is that any space between its components will get removed. An application processing an <address> element is then responsible for adding any necessary space or punctuation between the components of the address. Treating a structured element as a mixed-content one, or vice versa, should be done with care. If it is done, the schema should be customized to record the fact.

Preference for a default whitespace processing other than that described above can be indicated in <encodingDesc>. Preference for strict retention of every space, tab, carriage return and linefeed character in a text node can be signaled with @xml:space=”preserve”. Had this been done on the <persName> example above, the man’s name would get presented on five lines, indented, and with a blank line following. The @xml:space=”preserve” attribute is rarely used in TEI documents because whatever could be accomplished by doing so is generally accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space>, and others.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

I have marked this up as normal ODD XML, and append it below. I have now read it three times, and I believe it tells the truth

<div>
<head>White space</head>
The global attribute <att>xml:space</att> provides a mechanism for signaling to users
of an XML file how whitespace, that is, consecutive tab (#x09), space
(#x20), carriage return (#x0D) and/or linefeed (#x0A) characters, should be
treated. There are only two allowed values, <val>preserve</val> and
<val>default</val>. The first indicates that whitespace in a text node—every
carriage return, every tab, etc.—should be maintained as is. The second,
or just exclusion of the attribute, indicates that whitespace should be
handled as appropriate. What is deemed appropriate is left unspecified by
the XML Recommendation.
These Guidelines do not specify a default behavior, but examples in the
Guidelines generally presume one of two behaviors. For an element that can
contain only other elements, a so-called structured element, it is presumed
that whitespace has no semantic significance and can and indeed will be
removed. For example, in a <gi>choice</gi> element, such as
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<choice>
<sic>1724</sic>
<corr>1728</corr>
</choice>
</egXML>
no non-whitespace characters are allowed between opening of the <gi>choice</gi>
tag and opening of the <gi>sic</gi> tags or between the <gi>sic</gi> and <gi>corr</gi> tags, so
any space there has no significance and can treated as if not there at all.
A list of such structured elements is included in the TEI release file
stripspace.xsl.model, formatted there for use as an <gi>xsl:strip-space</gi>
command for XSL stylesheets.
For an element that can contain both text and other elements, so-called
mixed-content elements, examples in the Guidelines presume that whitespace
will be normalized. This means that all space, carriage return, linefeed,
and tab characters are converted into spaces, all consecutive spaces are
then deleted and replaced by one space, and then space immediately after a
start tag or immediately before an end tag is deleted. The result is that
this encoding,
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<persName>
<forename>Edward</forename>
<forename>George</forename>
<surname type="linked">Bulwer-Lytton</surname>,
<roleName>Baron Lytton of
<placeName>Knebworth</placeName>
</roleName>
</persName>
</egXML>
represents <q>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</q>.
The space before his name has been removed, a space included between his
forenames, the comma preserved, the newlines after his name removed, etc. 
The <gi>persName</gi> element is a mixed-content element; punctuation such as the
comma in the above example lies outside any of the child elements. The
<gi>address</gi> element, on the other hand, is like <gi>list</gi> and <gi>choice</gi>, a
structured element. (These three are included in stripspace.xsl.model.) No
punctuation is allowed outside the elements within <gi>address</gi> and the
presumed processing behavior is that any space between its components will
get removed. An application processing an <gi>address</gi> element is then
responsible for adding any necessary space or punctuation between the
components of the address. Treating a structured element as a mixed-content
one, or vice versa, should be done with care. If it is done, the schema
should be customized to record the fact.
Preference for a default whitespace processing other than that described
above can be indicated in <gi>encodingDesc</gi>. Preference for strict retention
of every space, tab, carriage return and linefeed character in a text node
can be signaled with <code>xml:space=”preserve”</code>. Had this been done on the
<gi>persName</gi> example above, the man’s name would get presented on five
lines, indented, and with a blank line following. The
<code>xml:space=”preserve”</code> attribute is rarely used in TEI documents because
whatever could be accomplished by doing so is generally accomplished with
less risk and more precision by using native TEI elements such as <gi>l</gi>,
<gi>lb</gi>, <gi>space</gi>, and others.
</div>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Hawkins - 2013-01-16

Proposed text looks fine to me. In the original ticket, it refers to a new passage that was added recently regarding @xml:space. While I could look for old tickets to see what this might be, I think it would be best if John could state in a new comment on this ticket exactly what text he proposes removing from the Guidelines in addition to adding the new section. Then we can really make sure we're all in agreement.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

In the example, be sure the XML is hierarchically indented. In the <choice> example, <sic> and <corr> should be indented. In the <persName> example, all the sub-elements should be indented. If you can make the whole block -- the <choice> and the <persName> -- indented as well, all the better. The indents are needed to illustrate points in the text.

In the persName example, you changed "produce" to "represents". The point that a certain processing will render a certain encoding a certain way. The point is about how a certain representation will be presented. To me "represents" is a pre-rendering concept. How about "will be rendered as" or "will be presented as"? (The second would match "would get presented" used in the final paragraph.)

Add the second comma here : "on the other hand, is, like"

Add this comma: "<gi>address</gi>, and"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

the pretty-printed of <egXML> is automatic, it will do the right thing

produce vs represents: we cant talk about output here. the question is to do
with what you see in the original, and how you are encoding it. but i see what you mean.
its hard.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

To answer Kevin:

I propose that in the version of 1.3.1.1.4 Other global attributes, contained in this revision: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#STEC, my submission here replace the passage beginning "The XML Recommendation defines whitespace as . . ." and ending " . . . is not then permitted to introduce additional spaces," (which is the very end of 1.3.1.1.4).

I also propose that 1.3.1.1.4 be split into one section for @xml:base and one for @xml:space. Sebastian's proposal that the second be headed "White space" is fine.

In the earlier Feature Request ticket, 3554294, I suggested the entry for @xml:space on the reference page for global attributes be changed to this:

----

default: A processor should treat XML whitespace as appropriate.

preserve: A processor should preserve all XML whitespace.

Note:

Use of "preserve" is seldom appropriate in TEI documents. Whatever could be accomplished by using it is usually accomplished with less risk and more precision by using native TEI elements such as <l>, <lb>, <space> and others.

Neither XML nor TEI specify default processing for XML whitespace. Examples in TEI 5 Guidelines, however, presume that such whitespace will be normalized according to conventional rules for normalizing whitespace in mixed-content XML structures. If encoders expect applications to process whitespace otherwise, this should be noted in <encodingStmt>. Unless indicated otherwise by such a statement or by xml:space='preserve', consumers of TEI encodings should normalize XML whitespace.

For further background and recommendations, see XML Whitespace in the TEI Wiki (http://wiki.tei-c.org/index.php/XML_Whitespace) and the XML specification (http://www.w3.org/TR/REC-xml/#sec-white-space).

----

This change to the reference page may not be necessary if the Guidelines are being changed, though I do still think there is value in a link to the Wiki page, a fuller explication intended more for schema-designers and authors of processing software, the kind of people who would read the reference page and not just the Guidelines. I just don't know if links to the Wiki are acceptable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

> produce vs represents: we cant talk about output here.

But of course it is that very boundary between representation and presentation that is here our subject. And that sentence in particular is telling the reader what will happen on the presentation side of that line. If you are OK with "would get presented as" in the final paragraph then you have to be OK with "would be presented as" in the earlier one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

i take the the point.

i am bringing the prose up in the pre-release so
we can see it in context

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Cummings - 2013-01-16

A candidate version added at: http://teijenkins.hcmc.uvic.ca:8080/job/TEIP5-Documentation/ws/Guidelines-web/en/html/ST.html#index-body.1_div.1_div.3_div.1_div.1_div.5

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

The spellings "behavior" and "behaviour" are both used.

Curly quote marks around one of the xml:space=”preserve” should be straightened.

I'm disappointed 'If it is done, the schema should be customized to record the fact." was removed. It seems wrong for Guidelines to condone use of an element in a way that would violate an element's declaration. "Oh, go ahead, just be careful" seems too mild.

Should 1.3.1.1.4 now be renamed?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

the reason for taking out "the schema should be customized to record
the fact" was that we dont explain what that actually means, or how to do it,
so it seemed safer to omit now; we can expand on it in a later release if we can find
suitable wording.

this process isnt ideal, editing prose just hours before a release, so my feeling at least
is to be cautious. its not the last chance at it ever, after all.

but others are editing it as I speak.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

in the example
<persName>
<forename>Edward</forename>
<forename>George</forename>
<surname type="linked">Bulwer-Lytton</surname>,
<roleName>Baron Lytton of
<placeName>Knebworth</placeName>
</roleName>
</persName>
why do we say "The space before his name has been removed"? surely that whitespace after <persName> before <forename> is as valid as any other in here?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

Actually no. In the standard algorithm for normalizing whitespace in mixed-content elements, that space would be trimmed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

The first text node in an element gets left-trimmed. The last text node in an element gets right-trimmed. A text node that is both first and last, i.e., is the only node in the element, gets left- and right-trimmed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John P. McCaskey - 2013-01-16

I should have said: If a text node is first (or last) in an element, it gets left- (or right-) trimmed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

draft now at http://bits.nsms.ox.ac.uk:8080/jenkins/job/TEIP5/lastSuccessfulBuild/artifact/release/doc/tei-p5-doc/en/html/ST.html#index-body.1_div.1_div.3_div.1_div.1_div.5

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-16

no, cancel that, its still building :-{

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2013-01-17

latest at http://bits.nsms.ox.ac.uk:8080/jenkins/job/TEIP5/lastSuccessfulBuild/artifact/release/doc/tei-p5-doc/en/html/ST.html#STGAxs, now stable

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2013-01-26

Thanks for proposing a revision, which is now in the latest release. Closing this ticket on the assumption the revision is at least not wrong, or not wrong in the same way.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2013-01-26

status: open --> closed-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Re-open @xml:space - ID: 3554294

TEI produces the TEI Guidelines and associated software

Group

Searches

Help

#500 Re-open @xml:space - ID: 3554294

Discussion