Re: [Sax-devel] Character Entities vs. General Entities
Brought to you by:
dmegginson
From: Arjun R. <ar...@ny...> - 2002-05-29 21:18:37
|
Regarding the issues raised by Jeff Rafter: Since the SGML spec (ISO 8879 + TCs) isn't normative for XML, it can't be used to settle problems with the latter. However, since XML is derived from SGML, and valid XML documents are required to be valid SGML documents, the lack of doubt regarding "intellectual heritage" can still be invoked to understand "intent". So, what follows is a SGML view. There are a number of issues here. 1. Incorrect Terminology. "Character entity" is a pernicious neologism. There is no such thing. There are character references, and there are entity references. They = are conceptually and operationally distinct categories, despite the = congruence of the leading and trailing characters in their respective delimiters. =20 Note that the DOM3 draft cited is confused on this issue of usage when it says: "Within the character data of a document (outside of markup), any characters that cannot be represented directly are replaced with character references. Occurrences of '<' and '&' are replaced by the = =20 predefined entities < and &." =20 suggesting, inter alia, that "predefined entities" and "character references" are the same thing. They are not. The meaning of "replaced by character references" is that occurrences of '<' and '&' are replaced by *character references* < and & respectively. This, in fact, = is what SGML tools such as "sgmlnorm" (of the SP/OpenSP package) do. 2. Parsing Context. Entity declarations occur in the DTD. The entity text of an entity declaration is a _parameter literal_, in which character references and parameter entity references are recognized, but (general) entity references are not. However, this replacement text is not final. When interpolated into the text of a document, it is still parseable content, only now in a different parsing context, where as a rule, character references and general entity references are recognised but parameter entity references are not. (Both of these considerations independently apply, as a matter of fact, to the parsing of default value parameters in an ATTLIST declaration.) 3. Best Practice. When generating output that can be expected to be run through another SGML/XML parser, there are three options for text content (ie #PCDATA) = and two for attribute value literals, to handle markup-sensitive characters: 1. CDATA marked sections 2. Character references=20 3. General entity references =20 CDATA marked sections obviously do least damage to the original content (and would be my personal preference when they can be used). In SGML systems, (2) has been prefered to (3) for the obvious reason that (until the WebSGML TC) they required no ancillary declarations of entities. Prefered but not optimal, however, because in SGML systems character references have a dependency on the document character set. But this is not an issue in XML. Indeed, precisely because the document character set is fixed for all XML applications, character references should be the generally prefered alternative. They are recognized in all relevant parsing contexts (whereas entity references may not be.) But I would also give serious consideration to CDATA marked sections for text content of elements, especially where issues of "round tripping" might be relevant. Using entity references, even predefined ones, is IMHO fundamentally broken practice, based on nothing more than grandfathering uninformed expectations of HTML tag-soupers. [Tangential note: entity references like < and & were never the prefered means in SGML-aware environments; usual practice in text content was to defeat the "trailing context" requirement of delimiter recognition with a "null comment" construct: <<!> and &<!>; the problem rarely arose in attribute value literals, because stuffing them with arbitrary text, especially of a form more suitable for text content, was considered poor DTD design and bad practice.] |