dated info on XML processors

TEI produces the TEI Guidelines and associated software

Brought to you by: bleekere, ebeshero, hcayless, heb, and 7 others

#718 dated info on XML processors

Milestone: AMBER

Status: open

Owner: Syd Bauman

Labels: None

Priority: 5(default)

Updated: 2015-02-09

Created: 2015-01-21

Creator: Jens Østergaard Petersen

Private: No

CH-LanguagesCharacterSets.xml
/div/div[2]/div[6]/div[1]/head[1]

<head>Non-Unicode Character Sets and XML Processors</head>

Comment: This section is so dated that it is bound to confuse more than enlighten.

Discussion

Syd Bauman - 2015-01-30

Agreed. The portions for those who want to work with non-Unicode character sets should simply be deleted, and then any remaining bits about use of non-Unicode characters folded into the discussion thereof in Chapter 5 (WD).
The pointer to this section from vi.2.4 "Entry of Characters" (D4-44) should point to WD instead.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hugh A. Cayless - 2015-01-30

I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode. They may interpret them as though they were Unicode, but that's not the same thing at all.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sebastian Rahtz - 2015-01-30
  
  On 30 Jan 2015, at 17:56, Hugh A. Cayless hcayless@users.sf.net wrote:
  
  I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode.
  
  is that true? I thought XML processors were obliged to work internally in Unicode, and have to convert non-Unicode character encodings to Unicode (if they support them).
  
  so a character coming in on encoding XYZ has no idea internally where it came from after the initial parse
  
  --
  Sebastian Rahtz
  Chief Data Architect
  University of Oxford IT Services
  13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jens Østergaard Petersen - 2015-01-30

As far as I recall, it depends on which XML processor we are talking about. Entity references are visible as such to an XPath processor, but not to an XQuery processor where they are resolved as characters. String-length is what you use to test this. Don't know about XSLT processors (don't use them a lot).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Syd Bauman - 2015-01-31

I'm not sure what part y'all are talking about, but Hugh is correct. An XML document definitionally consists of Unicode characters. They may be stored using any of a variety of encodings (e.g., UTF-8, UTF-16, ISO-10646-UCS-4, ISO-8859-1). But

It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding.

— XML 1.0 2nd edition, 4.3.3

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jens Østergaard Petersen - 2015-02-05

I was clearly wrong in my last posting (https://sourceforge.net/p/tei/bugs/718/#029b), sorry.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hugh A. Cayless - 2015-02-09

assigned_to: Syd Bauman
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hugh A. Cayless - 2015-02-09

Assigning to Syd to recommend improvements on this section.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: