Menu

#221 Wrong parsing of html entities in case of using recognizeUnicodeChars

v2.25
closed-fixed
nobody
None
5
2021-09-24
2020-05-15
Simon Urli
No

We noticed a wrong behaviour when using the following flags:

cleanerProperties.setDeserializeEntities(true);
cleanerProperties.setRecognizeUnicodeChars(true);
cleanerProperties.setTranslateSpecialEntities(false);

basically if I set the flag for recognizing unicode characters to false, the html entities in tag content are properly parsed and deserialized, but if I set the flag to true then the same entities are not deserialized anymore in tag content. They are still deserialized in the attribute though.
Here's a unit test reproducing the bug:

@Test
    public void parseWithUnicodeChars() throws Exception
    {
        CleanerProperties cleanerProperties = new CleanerProperties();

        cleanerProperties.setDeserializeEntities(true);
        cleanerProperties.setRecognizeUnicodeChars(true);
        cleanerProperties.setTranslateSpecialEntities(false);

        HtmlCleaner cleaner = new HtmlCleaner(cleanerProperties);
        TagNode tagNode = cleaner.clean("<?xml version = \"1.0\"?>"

            + "<div foo=\"&#169;\">&#169;</div>"
            + "<div foo=\"baz&gt;buz\">baz&gt;buz</div>"
            + "<div foo=\"baz&buz\">baz&buz</div>");
        List<? extends TagNode> divList = tagNode.getElementListByName("div", true);
        assertEquals(3, divList.size());

        assertEquals("©", divList.get(0).getText().toString());
        assertEquals("©", divList.get(0).getAttributeByName("foo"));

        assertEquals("baz>buz", divList.get(1).getText().toString());
        assertEquals("baz>buz", divList.get(1).getAttributeByName("foo"));

        assertEquals("baz&buz", divList.get(2).getText().toString());
        assertEquals("baz&buz", divList.get(2).getAttributeByName("foo"));

        DomSerializer domSerializer = new DomSerializer(cleanerProperties, false);
        Document document = domSerializer.createDOM(tagNode);

        NodeList nodeList = document.getElementsByTagName("div");
        assertEquals(3, nodeList.getLength());
        assertEquals("©", nodeList.item(0).getTextContent());
        assertEquals("©", nodeList.item(0).getAttributes().getNamedItem("foo").getTextContent());

        assertEquals("baz>buz", nodeList.item(1).getAttributes().getNamedItem("foo").getTextContent());

        assertEquals("baz&buz", nodeList.item(2).getAttributes().getNamedItem("foo").getTextContent());

        // BUG: This is failing with baz&gt;buz
        assertEquals("baz>buz", nodeList.item(1).getTextContent());
        // BUG: This is failing with baz&amp;buz
        assertEquals("baz&buz", nodeList.item(2).getTextContent());
    }

Discussion

  • Michael Ryan

    Michael Ryan - 2021-04-23

    I just ran into this myself. I believe the behavior change occurred in https://sourceforge.net/p/htmlcleaner/code/521/ (released in 2.22). That change made it so that most HTML entities (e.g., "&atilde;") are now decoded (which is good for me!), but some characters that previously were not encoded are now being encoded, such as & and >.

    Even though escapeXml=false is passed into the DomSerializer constructor, Utils.escapeXml() still ends up getting called because recognizeUnicodeChars=true. The name of the method is a bit misleading as it handles both decoding and encoding entities. There doesn't seem to be a combination of configurations that will make it just do the decoding steps. Maybe need to have DomSerializer pass in the escapeXml boolean value into the escapeXml method and skip any encoding steps if false?

     

    Last edit: Michael Ryan 2021-04-23
  • Scott Wilson

    Scott Wilson - 2021-05-03

    I'll look into this one - its difficult finding a balance with the parsing rules that fits the various use cases!

     
  • Scott Wilson

    Scott Wilson - 2021-05-03
    • status: open --> open-accepted
    • Group: v2.24 --> v2.25
     
  • Scott Wilson

    Scott Wilson - 2021-05-03

    Setting to fix in 2.25

     
  • Scott Wilson

    Scott Wilson - 2021-05-03

    I've committed a change to trunk which solves some of these cases by separating out the calls to Utils.excapeXml() and Utils.deserializeEntities() and the conditions they are called in by DomSerializer as I think some of the logic here was getting tangled up. That fixes Simon's original test case but @Michael if you have another test case I can use that would be great. (Alternatively, you can run the current source code and see if it addresses the issue.)

     
  • Simon Urli

    Simon Urli - 2021-09-24

    Hi @scottwilson we are hitting this bug again, would that be possible for you to release a version with the fix? We tested a snapshot and it seems to indeed fix our test.

     
  • Scott Wilson

    Scott Wilson - 2021-09-24

    No problem Simon, I'll get a release done today.

     
  • Scott Wilson

    Scott Wilson - 2021-09-24
    • status: open-accepted --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB