Agreed. The portions for those who want to work with non-Unicode character sets should simply be deleted, and then any remaining bits about use of non-Unicode characters folded into the discussion thereof in Chapter 5 (WD).
The pointer to this section from vi.2.4 "Entry of Characters" (D4-44) should point to WD instead.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode. They may interpret them as though they were Unicode, but that's not the same thing at all.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode.
is that true? I thought XML processors were obliged to work internally in Unicode, and have to convert non-Unicode character encodings to Unicode (if they support them).
so a character coming in on encoding XYZ has no idea internally where it came from after the initial parse
--
Sebastian Rahtz
Chief Data Architect
University of Oxford IT Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As far as I recall, it depends on which XML processor we are talking about. Entity references are visible as such to an XPath processor, but not to an XQuery processor where they are resolved as characters. String-length is what you use to test this. Don't know about XSLT processors (don't use them a lot).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure what part y'all are talking about, but Hugh is correct. An XML document definitionally consists of Unicode characters. They may be stored using any of a variety of encodings (e.g., UTF-8, UTF-16, ISO-10646-UCS-4, ISO-8859-1). But
It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding.
— XML 1.0 2nd edition, 4.3.3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Agreed. The portions for those who want to work with non-Unicode character sets should simply be deleted, and then any remaining bits about use of non-Unicode characters folded into the discussion thereof in Chapter 5 (WD).
The pointer to this section from vi.2.4 "Entry of Characters" (D4-44) should point to WD instead.
I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode. They may interpret them as though they were Unicode, but that's not the same thing at all.
is that true? I thought XML processors were obliged to work internally in Unicode, and have to convert non-Unicode character encodings to Unicode (if they support them).
so a character coming in on encoding XYZ has no idea internally where it came from after the initial parse
--
Sebastian Rahtz
Chief Data Architect
University of Oxford IT Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431
As far as I recall, it depends on which XML processor we are talking about. Entity references are visible as such to an XPath processor, but not to an XQuery processor where they are resolved as characters. String-length is what you use to test this. Don't know about XSLT processors (don't use them a lot).
I'm not sure what part y'all are talking about, but Hugh is correct. An XML document definitionally consists of Unicode characters. They may be stored using any of a variety of encodings (e.g., UTF-8, UTF-16, ISO-10646-UCS-4, ISO-8859-1). But
— XML 1.0 2nd edition, 4.3.3
I was clearly wrong in my last posting (https://sourceforge.net/p/tei/bugs/718/#029b), sorry.
Assigning to Syd to recommend improvements on this section.