HtmlCleaner / Bugs / #210 escapeXml ignored by DomSerializer

#210 escapeXml ignored by DomSerializer

Milestone: v2.23

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2019-09-04

Created: 2018-08-29

Creator: Code Buddy

Private: No

I've just tried to update from version 2.21 to 2.22 and noticed the following.

It seems even though I'm setting escapeXml to false in the constructor of DomSerializer escaping is happening.

The following test passes in 2.21, but fails in 2.22.

public class HtmlCleanerUnitTest
{
private static final boolean ESCAPE_XML = false;

@Test
public void test()
{
    final String nonAsciiWord = "hemförsäkring";
    final String html = "<html>"

            + "<body>"
            + "<p>"
            + nonAsciiWord
            + "</p>"
            + "</body>"
            + "</html>";

    final String expectedOutput = 
            "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n"

            + "<html>\n" + 
            "    <head/>\n" + 
            "    <body>\n" + 
            "        <p>" + nonAsciiWord + "</p>\n" + 
            "    </body>\n" + 
            "</html>\n"
            + "";

    try
    {
        final TagNode tagNode = new HtmlCleaner().clean(html);
        final CleanerProperties cleanerProperties = new CleanerProperties();

        final Document doc = new DomSerializer(cleanerProperties, ESCAPE_XML).createDOM(tagNode);
        System.out.println("doc: " + documentToString(doc));

        assertEquals(expectedOutput, documentToString(doc));

    }
    catch (ParserConfigurationException | RuntimeException | StackOverflowError e)
    {
        System.err.println("Failed to parse html, reason: " + e);
    }
}

public static String documentToString(
    final Document doc)
{
    String ret = "";
    final TransformerFactory tf = TransformerFactory.newInstance();
    try
    {
        final Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "xml");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
        final StringWriter stringWriter = new StringWriter();
        transformer.transform(new DOMSource(doc), new StreamResult(stringWriter));
        ret = stringWriter.getBuffer().toString();
    }
    catch (TransformerException e)
    {
        System.err.println("Failed to toString document " + e);
    }
    return ret;
}

}

Discussion

Numa Schmeder - 2019-02-06

I have the same issue where all my documents now have é or similar instead of é or à because of escaping when using DomSerializer. The SimpleHtmlSerializer doesn't have this problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2019-08-23

Looks like an issue introduced when applying a combination of rules. I've modified the way escaping is handled when called from DomSerializer, which fixes the problem above, but I'll keep testing with more combinations to make sure we don't cause issues elsewhere.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2019-09-04

status: open --> closed-fixed

Group: v 2.7 --> v2.23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

escapeXml ignored by DomSerializer

Group

Searches

Help

#210 escapeXml ignored by DomSerializer

Discussion