Re: [Sax-devel] Character Entities vs. General Entities

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Regarding the issues raised by Jeff Rafter:

Since the SGML spec (ISO 8879 + TCs) isn't normative for XML, it can't be
used to settle problems with the latter.  However, since XML is derived
from SGML, and valid XML documents are required to be valid SGML
documents, the lack of doubt regarding "intellectual heritage" can still
be invoked to understand "intent".  So, what follows is a SGML view.

There are a number of issues here.

1.  Incorrect Terminology.

"Character entity" is a pernicious neologism.  There is no such thing.
There are character references, and there are entity references.  They =
are
conceptually and operationally distinct categories, despite the =
congruence
of the leading and trailing characters in their respective delimiters. =20

Note that the DOM3 draft cited is confused on this issue of usage when it
says:

 "Within the character data of a document (outside of markup), any
  characters that cannot be represented directly are replaced with
  character references. Occurrences of '<' and '&' are replaced by the   =
=20
  predefined entities &lt; and &amp."
 =20
suggesting, inter alia, that "predefined entities" and "character
references" are the same thing.  They are not.  The meaning of "replaced
by character references" is that occurrences of '<' and '&' are replaced
by *character references* &#60; and &#38; respectively.  This, in fact, =
is
what SGML tools such as "sgmlnorm" (of the SP/OpenSP package) do.

2.  Parsing Context.

Entity declarations occur in the DTD.  The entity text of an entity
declaration is a _parameter literal_, in which character references and
parameter entity references are recognized, but (general) entity
references are not.  However, this replacement text is not final.  When
interpolated into the text of a document, it is still parseable content,
only now in a different parsing context, where as a rule, character
references and general entity references are recognised but parameter
entity references are not.  (Both of these considerations independently
apply, as a matter of fact, to the parsing of default value parameters in
an ATTLIST declaration.)

3.  Best Practice.

When generating output that can be expected to be run through another
SGML/XML parser, there are three options for text content (ie #PCDATA) =
and
two for attribute value literals, to handle markup-sensitive characters:

  1.  CDATA marked sections
  2.  Character references=20
  3.  General entity references
 =20
CDATA marked sections obviously do least damage to the original content
(and would be my personal preference when they can be used).  In SGML
systems, (2) has been prefered to (3) for the obvious reason that (until
the WebSGML TC) they required no ancillary declarations of entities.
Prefered but not optimal, however, because in SGML systems character
references have a dependency on the document character set.  But this is
not an issue in XML.

Indeed, precisely because the document character set is fixed for all XML
applications, character references should be the generally prefered
alternative.  They are recognized in all relevant parsing contexts
(whereas entity references may not be.)  But I would also give serious
consideration to CDATA marked sections for text content of elements,
especially where issues of "round tripping" might be relevant.

Using entity references, even predefined ones, is IMHO fundamentally
broken practice, based on nothing more than grandfathering uninformed
expectations of HTML tag-soupers.  [Tangential note: entity references
like &lt; and &amp; were never the prefered means in SGML-aware
environments; usual practice in text content was to defeat the "trailing
context" requirement of delimiter recognition with a "null comment"
construct: <<!> and &<!>; the problem rarely arose in attribute value
literals, because stuffing them with arbitrary text, especially of a form
more suitable for text content, was considered poor DTD design and bad
practice.]