Character converts to junk by XSL Transform

Help
2010-08-11
2012-10-08
  • Vinothkumar M R

    Vinothkumar M R - 2010-08-11

    Problem: while converting an xhtml using xsl transformation, few characters
    get converted to a junk, In the below code, my input has Á and Â.
    After xsl transformation, Â has proper character where Á results in
    ??. I need to use UTF-8 character set.

    Please help me to resolve the issue. How can i get all the special characters
    properly in the output.?

    JAVA CODE: (XSL is pasted below the java code)

    import java.io.ByteArrayOutputStream;
    import java.io.File;
    import java.io.OutputStream;
    import java.io.StringReader;
    import java.nio.charset.Charset;

    import javax.xml.parsers.SAXParserFactory;
    import javax.xml.transform.sax.SAXSource;
    import javax.xml.transform.stream.StreamSource;

    import net.sf.saxon.s9api.Processor;
    import net.sf.saxon.s9api.SaxonApiException;
    import net.sf.saxon.s9api.Serializer;
    import net.sf.saxon.s9api.XdmNode;
    import net.sf.saxon.s9api.XsltCompiler;
    import net.sf.saxon.s9api.XsltExecutable;
    import net.sf.saxon.s9api.XsltTransformer;

    import org.xml.sax.EntityResolver;
    import org.xml.sax.InputSource;
    import org.xml.sax.XMLReader;

    public class XHTMLConvertor {

    public static final String XSL="C:/copy.xsl";

    public static void main(String args) {
    String input = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns="\\"[http://www.w3.org/1999/xhtml\](http://www.w3.org/1999/xhtml%5C)"" version="\\"-//W3C//DTD" XHTML="" 1.1="" EN\\"=""><body>

    special character 193 Á
    194 Â

    </body> </html>";
    try {
    XdmNode source1 = null;
    System.out.println("INPUT \n" +input);
    Charset csets = Charset.forName("UTF-8");
    Processor proc = new Processor(false);
    XsltCompiler comp = proc.newXsltCompiler();
    XsltExecutable exp = comp.compile(new StreamSource(new File(
    XSL)));
    net.sf.saxon.s9api.DocumentBuilder docBuilder = proc
    .newDocumentBuilder();
    source1 = docBuilder.build(getSAXSource(input));
    System.out.println("sax sRC \n"+source1.toString());
    OutputStream byteOutputStream = new ByteArrayOutputStream();
    Serializer out = new Serializer();
    out.setOutputProperty(Serializer.Property.METHOD, "xml");
    out.setOutputProperty(Serializer.Property.INDENT, "no");
    out.setOutputProperty(Serializer.Property.ENCODING, "UTF-8");
    out.setOutputStream(byteOutputStream);
    XsltTransformer trans = exp.load();
    trans.setInitialContextNode(source1);
    trans.setDestination(out);
    trans.transform();
    String response = byteOutputStream.toString();
    System.out.println("response>>> \n" +response);
    String utfStr = new String(byteOutputStream.toString().getBytes(), csets);
    System.out.println("utfStr>>> \n" +utfStr);
    } catch (SaxonApiException e) {
    e.printStackTrace();
    }
    }

    private static SAXSource getSAXSource(String input) {
    XMLReader xmlReader = null;
    try {
    xmlReader = SAXParserFactory.newInstance().newSAXParser()
    .getXMLReader();
    xmlReader.setEntityResolver(new EntityResolver() {
    public InputSource resolveEntity(String publicId,
    String systemId) {
    InputSource inputSource = new InputSource(new StringReader(
    ""));
    inputSource.setPublicId(publicId);
    inputSource.setSystemId(systemId);
    return inputSource;
    }
    });
    } catch (Exception se) {
    se.printStackTrace();
    }
    StringReader reader = new StringReader(input);
    InputSource inputSource = new InputSource(reader);
    SAXSource source = new SAXSource(xmlReader, inputSource);
    return source;
    }

    }

    XSL:(an xsl just to copy the input)

    <xsl:stylesheet version="1.0" xml:lang="UTF-8" xmlns:xsl="[http://www.w3.org/1 999/XSL/Transform](http://www.w3.org/1999/XSL/Transform)"> <xsl:output method="text" encoding="UTF-8"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>

    Thanks
    vinoth

     
  • Vinothkumar M R

    Vinothkumar M R - 2010-08-11

    More Info on the problem: i am using saxon9he.jar and running the code in
    windowsxp

     
  • Michael Kay

    Michael Kay - 2010-08-11

    The characters are not being turned into junk by your XSLT transformation, but
    rather when you call the toString() method on your ByteArrayOutputStream. The
    spec for this method says "Converts the buffer's contents into a string
    decoding bytes using the platform's default character set." The platform's
    default character set is probably iso-8859-1 - the method has no idea that
    your ByteArrayOutputStream actually contains the characters encoded in UTF-8.

    Java lets you specify the encoding when you convert a ByteArray to a string.
    You could do that; or you could ask Saxon to write directly to a StringWriter,
    in which case encoding is not an issue.

     
  • Vinothkumar M R

    Vinothkumar M R - 2010-08-12

    Thank you very much for the solution. It worked.

     

Log in to post a comment.