Menu

#210 escapeXml ignored by DomSerializer

v2.23
closed-fixed
nobody
None
5
2019-09-04
2018-08-29
Code Buddy
No

I've just tried to update from version 2.21 to 2.22 and noticed the following.

It seems even though I'm setting escapeXml to false in the constructor of DomSerializer escaping is happening.

The following test passes in 2.21, but fails in 2.22.

public class HtmlCleanerUnitTest
{
private static final boolean ESCAPE_XML = false;

@Test
public void test()
{
    final String nonAsciiWord = "hemförsäkring";
    final String html = "<html>"

            + "<body>"
            + "<p>"
            + nonAsciiWord
            + "</p>"
            + "</body>"
            + "</html>";

    final String expectedOutput = 
            "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n"

            + "<html>\n" + 
            "    <head/>\n" + 
            "    <body>\n" + 
            "        <p>" + nonAsciiWord + "</p>\n" + 
            "    </body>\n" + 
            "</html>\n"
            + "";

    try
    {
        final TagNode tagNode = new HtmlCleaner().clean(html);
        final CleanerProperties cleanerProperties = new CleanerProperties();

        final Document doc = new DomSerializer(cleanerProperties, ESCAPE_XML).createDOM(tagNode);
        System.out.println("doc: " + documentToString(doc));

        assertEquals(expectedOutput, documentToString(doc));

    }
    catch (ParserConfigurationException | RuntimeException | StackOverflowError e)
    {
        System.err.println("Failed to parse html, reason: " + e);
    }
}

public static String documentToString(
    final Document doc)
{
    String ret = "";
    final TransformerFactory tf = TransformerFactory.newInstance();
    try
    {
        final Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "xml");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
        final StringWriter stringWriter = new StringWriter();
        transformer.transform(new DOMSource(doc), new StreamResult(stringWriter));
        ret = stringWriter.getBuffer().toString();
    }
    catch (TransformerException e)
    {
        System.err.println("Failed to toString document " + e);
    }
    return ret;
}

}

Discussion

  • Numa Schmeder

    Numa Schmeder - 2019-02-06

    I have the same issue where all my documents now have é or similar instead of é or à because of escaping when using DomSerializer. The SimpleHtmlSerializer doesn't have this problem.

     
  • Scott Wilson

    Scott Wilson - 2019-08-23

    Looks like an issue introduced when applying a combination of rules. I've modified the way escaping is handled when called from DomSerializer, which fixes the problem above, but I'll keep testing with more combinations to make sure we don't cause issues elsewhere.

     
  • Scott Wilson

    Scott Wilson - 2019-09-04
    • status: open --> closed-fixed
    • Group: v 2.7 --> v2.23
     

Log in to post a comment.

MongoDB Logo MongoDB