Character encoding problem in saxon9

2008-04-08
2012-10-08
  • Hi all!

    I have a strange character encoding problem which occurs only in saxon9 but works fine in saxon8.

    I have a DOM document which is generated in java (ie. not read from file). I use simple transformation to output it as HTML from a java servlet. The code works fine and everything is correct when I use saxon8 but if I swap the jar to saxon9 jars in my jBoss's lib directory all non-english letters are output as "?".
    for example
    Saxon8 output äö
    ->
    Saxon9 output ??

    I don't change any code just update the jars. Anyone have any idea why is that? Does Saxon9 read the encoding from somewhere else? What is the difference between 8 and 9 encoding instructions? I can't get saxon9 working anyway I try. :-(

    Thanks for help!

    -Pasi

     
    • Michael Kay
      Michael Kay
      2008-04-08

      There's no obvious reason why a change from saxon8 to saxon9 should trigger this. Sometimes such things happen because your application was unknowingly dependent on some accident of the configuration, such as the order of JAR files in the classpath. I would suggest investigating it from first principles, rather than focusing on what has changed. (Although you could check whether reverting the change makes the problem go away again).

      First look at the actual HTML that is being generated. How are the special characters encoded? What does the <META> element say about the charset? Has the HTML changed since saxon8?

      Then try to establish the media type (MIME type) that is being used by the web server to serve the documents.

      Is the effect browser-dependent? If I recall correctly, Firefox is more inclined to trust what the HTTP message says about character encoding, whereas IE is more inclined to guess it from the actual HTML content (and indeed the file extension!).

      If the HTML is wrong, this is a Saxon problem and I can help you with it. If the HTML looks right but is being incorrectly displayed, then it's a configuration problem and I can't. So the first step is to distinguish these two cases.

       
      • Thanks for the reply Michael!

        Reverting back to Saxon8 makes the problem go away every time. All I do is shutdown jBoss, change jars to lib directory and start the jBoss. I don't even recompile the code. I've done that many times and the effect is consistent. Saxon8 always works and Saxon9 never works.

        Closer examination to the outputted HTML revealed that my original message was a bit misleading. It seems that Saxon8 actually escapes the umlaut characters (ä/ö).

        Saxon8 output:
        href="/@application_context_root@/Service?SERVICE_ID=30"
        title="">t&auml;&auml;lt&auml;.</a></span><span></span></div><span></span></td>

        same line with Saxon9:
        href="/@application_context_root@/Service?SERVICE_ID=30"
        title="">t??lt?.</a></span><span></span></div><span></span></td>

        Meta element is:
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        for both version and the effect occurs with both Firefox and IE.

        Maybe I've been looking into wrong place... is there some different default in character escaping between version 8 and 9.

        Sorry for incomplete first message.

        Thanks again.

        -Pasi

         
        • Michael Kay
          Michael Kay
          2008-04-08

          You didn't actually say which version of Saxon you were moving from: there were a lot of releases between 8.0 and 8.9. I think it must have been quite an old one. The serialization specification changed at some stage to say that HTML character entities should only be used where the actual character is not present in the target encoding, and Saxon of course changed to match that.

          Are the characters actually encoded in iso 8859-1, that is, does the encoding of the characters match what the <meta> element claims is the encoding? It looks to me as if they aren't, because if they were, then your copy/paste into a mail message would probably have worked. I suspect they are in UTF-8.

          What does your <xsl:output> declaration look like?

          Is the <meta> element being generated by Saxon, or is it written directly by your stylesheet?

           
          • You were right. There was some basic problem which I was able to circumvent (that didn't have anything to do with the version change. Saxon8 just did escape the chars and the problem wasn't therefore noticed before).
            It seems that when I write the transformation result directly to the servlet outputstream the character encoding gets messed up.
            code snippet:

            // this produces bad char encoding
            javax.xml.transform.Result result = new javax.xml.transform.stream.StreamResult(out); //out is a HttpOutputStream
            transformer.transform(source, result);

            // this works fine and produces nice UTF-8 encoding
            StringWriter strWriter = new StringWriter();
            javax.xml.transform.Result stringresult = new StreamResult(strWriter);
            transformer.transform(source, stringresult);
            out.write(strWriter.toString());

            Thank you for you help Michael! I managed to fix the problem when trying to answer your questions. Really appreciated!

            -Pasi