Menu

Saxon can not parse Æ, Ø, Å with error info: illegal html character decimal 152

saxon-help
jackbu
2013-01-05
2013-01-05
  • jackbu

    jackbu - 2013-01-05

    using windows-1252, all Æ, Ø, Å are in a mess.
    UTF-8 or iso-8859-1, error msg appears
    !image1 (http://meltwater.vacau.com/s1.png)
    !image2 (http://meltwater.vacau.com/s2.png)

     
  • Michael Kay

    Michael Kay - 2013-01-05

    The message "illegal HTML character decimal 152" indicates that you are trying to write the Unicode character 152 using the HTML serializer. Unicode character 152 is a control character which is not allowed in HTML.

    The cause of this problem is usually that a source document contains characters encoded in Windows CP1252 but the source document is mislabelled as iso-8859-1. Codepoints in the range 128-159 have a different meaning in cp-1252 from their meaning in Unicode, so correct labelling is important.

    The incorrect display of accented characters is a slightly different but related problem. This occurs when you output characters in UTF-8, but where the recipient (whatever software is displaying the characters, for example a browser) thinks they are iso-8859-1 (or perhaps cp-1252).

    To solve these problems it is important that all XML files have an XML declaration that correctly identifies the encoding, and also that the character encoding is correctly identified in HTTP headers and in the <meta> element of the HTML document.

    Since this is not a specific Saxon problem, if you want further help I would recommend the StackOverflow forums. You will need to provide very complete information about what you are doing, since character miscoding problems can occur almost anywhere in the system where two different software products talk to each other.

    Michael Kay
    Saxonica