HtmlCleaner / Bugs / #203 HC doesn't properly escape < and

Vincent Massol - 2018-06-19

For reference this is out issue on the XWiki side: https://jira.xwiki.org/browse/XCOMMONS-1164

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Vincent,

I've tried to replicate this in 2.22 but I don't get the same output. I see the same output you expect to find in your test, that the entities are preserved in the output attribute.

Scott

@Test
public void escapeMarkupInAttributes() throws IOException {
    HtmlCleaner theCleaner = new HtmlCleaner();
    theCleaner.getProperties().setAllowHtmlInsideAttributes(false);
    theCleaner.getProperties().setOmitXmlDeclaration(true);
    String initial = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
    String expected = "<html><head></head><body><div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div></body></html>";
    String cleaned = new SimpleHtmlSerializer(theCleaner.getProperties()).getAsString(theCleaner.clean(initial));
    assertEquals(cleaned, expected);
    theCleaner.getProperties().setAllowHtmlInsideAttributes(true);
    cleaned = new SimpleHtmlSerializer(theCleaner.getProperties()).getAsString(theCleaner.clean(initial));
    assertEquals(cleaned, expected);
}

Vincent Massol - 2018-06-19

Hi Scott! You're as fast as usal :)

I would have thought about some XWiki-related thing but for the fact that keeping the same code and just changing the HC version in the POM changes the result so this clearly suggests a change in behavior.

Maybe HC between 2.16 and 2.22 changed some default config values?

Note that "setAllowHtmlInsideAttributes(false)" is the default and we're not setting it (I've set it to test and same result). FTR this is the config we have: https://github.com/xwiki/xwiki-commons/blob/d1657b0da41bc6fdca0cefd6e03b2a3705b17cec/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/DefaultHTMLCleaner.java#L211

Also we're using our own DOM Serializer () so maybe something has changed from what is returned by the clean() method like not escaping attribute values and SimpleHtmlSerializer has also been modified to do the escaping?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I think it relates to this discussion and the associated change (commit 521):

https://sourceforge.net/p/htmlcleaner/discussion/637246/thread/fc7f2d65/?limit=25#b65b

If you look at the last change to DomSerializer, this should probably be replicated in XWikiDomSerializer:

-                   if (escapeXml && !specialCase) {
-                       content = Utils.escapeXml(content, props, true);
-                   }
+                   if (shouldEscapeOrTranslateEntities() && !specialCase) {
+                       content = Utils.escapeXml(content, props, true);
+                   }



+   private boolean shouldEscapeOrTranslateEntities() {
+       return escapeXml || props.isRecognizeUnicodeChars() || props.isTranslateSpecialEntities();
+   }
+
}

Vincent Massol - 2018-06-19

Thanks a lot Scott for this code/diff :)

I've tried it but it doesn't work though.

I'll need to look at the commit 521 a bit later. Thx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-21

Ok I've done more tests:

With HC 2.16 if I check the content of the cleaned node, I can see that the attribute has proper escaped values, i.e. aaa"bbb&ccc>ddd<eee

With HC 2.22, the value is not escaped aaa"bbb&ccc>ddd<eee

If I replace XWikiDOMSerializer by DOMSerializer with HC 2.22, I still get the problem and the test is not passing so the issue is not coming from XWikiDOMSerializer but from the cleaner it seems.

Any idea?

Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

Yes, HC will deserialize entities in attribute text when creating tagnodes, and the serializer will then apply the appropriate rules for entities depending on how its configured. (This is a change from earlier versions of HC which passed entities in attributes through untouched, and then applied escaping - there was a use case from a user for this change).

So when I inspect the foo attribute for the DIV Tagnode,, I see as you do:

aaa"bbb&ccc>ddd<eee

But when I inspect the attribute value of the DIV element in the Document created by DomSerializer I see:

aaa"bbb&ccc>ddd<eee

However, if I initialize the DomSerializer with escapeXml=false then I see:

aaa"bbb&ccc>ddd<eee

This is because of this line in DomSerializer:

if (escapeXml) { attrValue = Utils.escapeXml(attrValue, props, true); }

So I suspect thats the issue? Is there a reason that escapeXml is turned off in XWiki?

So a fix could be to include this line in XWikiDomSerializer:

attrValue = Utils.escapeXml(attrValue, props, true);

... or to initalise the DomSerializer without setting escapeXml=false.

Does that help?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yes it helps,, thanks. I hadn't realized we were initializing the XWikiDOMSerialize with escapeXml=false (had forgotten).

I tried putting it to true and indeed it fixes this issue, but breaks 3 other tests (which is why it was off before ;)):

Problem 1

    @Test
    public void verifyEntitiesAreNotBroken()
    {
        assertHTML("<p>&Eacute;</p>", "&Eacute;");
    }

Generates <html><head></head><body>É</body></html> instead of <html><head></head><body>É</body></html>.

Problem 2

    @Test
    public void specialCharacters()
    {
        // TODO: We still have a problem I think in that if there are characters such as "&" or quote in the source
        // text they are not escaped. This is because we have use "false" in DefaultHTMLCleaner here:
        // Document document = new JDomSerializer(this.cleanerProperties, false).createJDom(cleanedNode);
        // See the problem described here: http://sourceforge.net/forum/forum.php?thread_id=2243880&forum_id=637246
        assertHTML("<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>",
            "<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>");
        assertHTML("<p>\"&amp;</p>", "<p>\"&</p>");
        assertHTML("<p><img src=\"http://host.com/a.gif?a=foo&amp;b=bar\" /></p>",
            "<img src=\"http://host.com/a.gif?a=foo&b=bar\" />");
        assertHTML("<p>&#xA;</p>", "<p>&#xA;</p>");

        // Verify that double quotes are escaped in attribute values
        assertHTML("<p value=\"script:&quot;&quot;\"></p>", "<p value='script:\"\"'");
    }

Generates <html><head></head><body>"&**notbold**<notag> </body></html> instead of <html><head></head><body>"&**notbold**<notag> </body></html>

Problem 3

    @Test
    public void restrictedHtml()
    {
        HTMLCleanerConfiguration configuration = this.cleaner.getDefaultConfiguration();
        Map<String, String> parameters = new HashMap<String, String>();
        parameters.putAll(configuration.getParameters());
        parameters.put("restricted", "true");
        configuration.setParameters(parameters);

        String result = HTMLUtils.toString(this.cleaner.clean(
            new StringReader("<script>alert(\"foo\")</script>"), configuration));
        assertEquals(HEADER_FULL + "<pre>alert(\"foo\")</pre>" + FOOTER, result);

        result = HTMLUtils.toString(this.cleaner.clean(
            new StringReader("<style>p {color:white;}</style>"), configuration));
        assertEquals(HEADER_FULL + "<pre>p {color:white;}</pre>" + FOOTER, result);
    }

Generates <html><head></head><body><pre>alert("foo")</pre></body></html> instead of <html><head></head><body><pre>alert("foo")</pre></body></html>.

Do these tests seem valid to you?

Thx

Vincent Massol - 2018-06-21

And I believe you're confirming that the cleaner has changed behavior since 2.16 and that it's normal and wanted that it doesn't escape attribute values any more, correct?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-21

Note that if I use DOMSerializer instead of XWikiDOMSerializer I also get these 3 problems (and 2 more failing tests related to CDATA).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

Yes that's correct - its a change since 2.21.

Some of the other tests you have there relate to preserving entities rather than replacing them with unicode; I think thats something that should be set in cleaner properties rather than by turning off XML escaping.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

... for example, if you want to retain the É entity rather than replace it with the unicode character É, you'd set translateSpecialEntities=false.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yes thanks for that hint, that fixes Problem 1 (and Problem 2 for some reason).

I now have Problem 3 + a new one:

Problem 4

    @Test
    public void specialCharacters()
    {
        // TODO: We still have a problem I think in that if there are characters such as "&" or quote in the source
        // text they are not escaped. This is because we have use "false" in DefaultHTMLCleaner here:
        // Document document = new JDomSerializer(this.cleanerProperties, false).createJDom(cleanedNode);
        // See the problem described here: http://sourceforge.net/forum/forum.php?thread_id=2243880&forum_id=637246
        assertHTML("<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>",
            "<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>");
        assertHTML("<p>\"&amp;</p>", "<p>\"&</p>");
        assertHTML("<p><img src=\"http://host.com/a.gif?a=foo&amp;b=bar\" /></p>",
            "<img src=\"http://host.com/a.gif?a=foo&b=bar\" />");
        assertHTML("<p>&#xA;</p>", "<p>&#xA;</p>");

        // Verify that double quotes are escaped in attribute values
        assertHTML("<p value=\"script:&quot;&quot;\"></p>", "<p value='script:\"\"'");
    }

This time it's the second assert that fails with <html><head></head><body>"&</body></html> instead of <html><head></head><body>"&</body></html>

Thanks a lot for your help, much appreciated. I haven't touched this code for a while and I'm a bit rusty on it and I don't remember a lot of things....

Scott Wilson - 2018-06-21

Yes, it gets pretty complicated when you want some entities and not others...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-21

AFAIR what we wanted was XHTML so it's normal to escape ampersand (since it's one of the 5 special characters in XML) but the quote doesn't need to be escaped in XHTML.

So are you saying there's no way to escape only what's needed for XHTML and translateSpecialEntities=true will escape more than needed?

Last edit: Vincent Massol 2018-06-21

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

Hmm, SimpleHtmlSerializer is a bit aggressive at replacing characters with their entity equivalents. Its because the same code is used for escaping content in attributes (where we need to be strict) and elements (where convention is to be relatively loose).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

Its all in the serializer as far as I can tell - the actual TagNode tree has the 'right' information, I think the unwanted behaviour is in the serializer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-21

Strictly speaking the serialization is correct with regards to the validity of the output, its just being a little overzealous in its use of ".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-27

1) I'm also getting this now:

assertHTML("\n", "
");

whereas before, I had:

assertHTML("
", "
");

(
 is a line feed)

Does that seem normal to you?

2) Should I open an issue for the overzealous quote escape?

Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-27

For (1) do you mean that, where before you had a line feed character as a unicode entity, you are now getting a standard line feed character?

For (2) yes I think it needs its own bug report. Its not an error as such, but its not what many users expect from a tool such as HC.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-27

For (1) do you mean that, where before you had a line feed character as a unicode entity, you are now getting a standard line feed character?

Yes correct.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-27

When RecognizeUnicodeCharacters is set to True, HC will replace any unicode entities with unicode characters. If set to false, it'll leave the unicode entity references in place. (Earlier releases of HC were less consistent in applying this.)

You may want to check if this causes any issues in further processing, but any XML/XHTML processing should treat the newline character and the newline entity reference as being equivalent.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2018-06-28

I've gone ahead yesterday and applied the HC 2.22 upgrade with the change of tests. I'm crossing fingers now since it's hard to foresee what the behavior changes will lead to.

Thx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-06-28

Fingers crossed here too!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I have this test for the 2.23 snapshot based on the original bug report, which is now passing:

// See bug #203
@Test
public void parse2() throws Exception
{
    String html = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
    String expected = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
    final CleanerProperties cleanerProperties = new CleanerProperties();
    final TagNode tagNode = new HtmlCleaner().clean(html);
    cleanerProperties.setOmitHtmlEnvelope(true);
    cleanerProperties.setOmitXmlDeclaration(true);
    String out = new SimpleXmlSerializer(cleanerProperties).getAsString(html);
    assertEquals(expected, out);
}

HC doesn't properly escape < and > characters anymore in attributes

Group

Searches

Help

#203 HC doesn't properly escape < and > characters anymore in attributes

Discussion

Problem 1

Problem 2

Problem 3

Problem 4