Menu

#718 dated info on XML processors

AMBER
open
None
5(default)
2015-02-09
2015-01-21
No

CH-LanguagesCharacterSets.xml
/div/div[2]/div[6]/div[1]/head[1]

<head><!--4.5.1 -->Non-Unicode Character Sets and XML Processors</head>

Comment: This section is so dated that it is bound to confuse more than enlighten.

Discussion

  • Syd Bauman

    Syd Bauman - 2015-01-30

    Agreed. The portions for those who want to work with non-Unicode character sets should simply be deleted, and then any remaining bits about use of non-Unicode characters folded into the discussion thereof in Chapter 5 (WD).
    The pointer to this section from vi.2.4 "Entry of Characters" (D4-44) should point to WD instead.

     
  • Hugh A. Cayless

    Hugh A. Cayless - 2015-01-30

    I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode. They may interpret them as though they were Unicode, but that's not the same thing at all.

     
    • Sebastian Rahtz

      Sebastian Rahtz - 2015-01-30

      On 30 Jan 2015, at 17:56, Hugh A. Cayless hcayless@users.sf.net wrote:

      I'm not even sure the bulk of that is correct. Processors don't automatically convert non-Unicode characters to Unicode.

      is that true? I thought XML processors were obliged to work internally in Unicode, and have to convert non-Unicode character encodings to Unicode (if they support them).

      so a character coming in on encoding XYZ has no idea internally where it came from after the initial parse

      --
      Sebastian Rahtz
      Chief Data Architect
      University of Oxford IT Services
      13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

       
  • Jens Østergaard Petersen

    As far as I recall, it depends on which XML processor we are talking about. Entity references are visible as such to an XPath processor, but not to an XQuery processor where they are resolved as characters. String-length is what you use to test this. Don't know about XSLT processors (don't use them a lot).

     
  • Syd Bauman

    Syd Bauman - 2015-01-31

    I'm not sure what part y'all are talking about, but Hugh is correct. An XML document definitionally consists of Unicode characters. They may be stored using any of a variety of encodings (e.g., UTF-8, UTF-16, ISO-10646-UCS-4, ISO-8859-1). But

    It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding.

    — XML 1.0 2nd edition, 4.3.3

     
  • Jens Østergaard Petersen

    I was clearly wrong in my last posting (https://sourceforge.net/p/tei/bugs/718/#029b), sorry.

     
  • Hugh A. Cayless

    Hugh A. Cayless - 2015-02-09
    • assigned_to: Syd Bauman
     
  • Hugh A. Cayless

    Hugh A. Cayless - 2015-02-09

    Assigning to Syd to recommend improvements on this section.

     
MongoDB Logo MongoDB