Menu

#203 HC doesn't properly escape < and > characters anymore in attributes

v2.23
closed-fixed
nobody
None
5
2019-09-05
2018-06-19
No

Hi, on XWiki we've tried to upgrade from HC 2.16 to 2.22 and had to rollback during the release because of a regression we noticed:

Input:

<div foo="aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee">content</div>

Generates:

<div foo="aaa&quot;bbb&amp;ccc>ddd<eee">content</div>

Whereas we were expecting the following (which is what we get with HC 2.16):

<div foo="aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee">content</div>

Thanks!

Discussion

1 2 > >> (Page 1 of 2)
  • Vincent Massol

    Vincent Massol - 2018-06-19

    For reference this is out issue on the XWiki side: https://jira.xwiki.org/browse/XCOMMONS-1164

     
  • Scott Wilson

    Scott Wilson - 2018-06-19

    Hi Vincent,

    I've tried to replicate this in 2.22 but I don't get the same output. I see the same output you expect to find in your test, that the entities are preserved in the output attribute.

    • Scott
    @Test
    public void escapeMarkupInAttributes() throws IOException {
        HtmlCleaner theCleaner = new HtmlCleaner();
        theCleaner.getProperties().setAllowHtmlInsideAttributes(false);
        theCleaner.getProperties().setOmitXmlDeclaration(true);
        String initial = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
        String expected = "<html><head></head><body><div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div></body></html>";
        String cleaned = new SimpleHtmlSerializer(theCleaner.getProperties()).getAsString(theCleaner.clean(initial));
        assertEquals(cleaned, expected);
        theCleaner.getProperties().setAllowHtmlInsideAttributes(true);
        cleaned = new SimpleHtmlSerializer(theCleaner.getProperties()).getAsString(theCleaner.clean(initial));
        assertEquals(cleaned, expected);
    }
    
     
  • Vincent Massol

    Vincent Massol - 2018-06-19

    Hi Scott! You're as fast as usal :)

    I would have thought about some XWiki-related thing but for the fact that keeping the same code and just changing the HC version in the POM changes the result so this clearly suggests a change in behavior.

    Maybe HC between 2.16 and 2.22 changed some default config values?

    Note that "setAllowHtmlInsideAttributes(false)" is the default and we're not setting it (I've set it to test and same result). FTR this is the config we have: https://github.com/xwiki/xwiki-commons/blob/d1657b0da41bc6fdca0cefd6e03b2a3705b17cec/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/DefaultHTMLCleaner.java#L211

    Also we're using our own DOM Serializer () so maybe something has changed from what is returned by the clean() method like not escaping attribute values and SimpleHtmlSerializer has also been modified to do the escaping?

     
  • Scott Wilson

    Scott Wilson - 2018-06-19

    I think it relates to this discussion and the associated change (commit 521):

    https://sourceforge.net/p/htmlcleaner/discussion/637246/thread/fc7f2d65/?limit=25#b65b

    If you look at the last change to DomSerializer, this should probably be replicated in XWikiDomSerializer:

    -                   if (escapeXml && !specialCase) {
    -                       content = Utils.escapeXml(content, props, true);
    -                   }
    +                   if (shouldEscapeOrTranslateEntities() && !specialCase) {
    +                       content = Utils.escapeXml(content, props, true);
    +                   }
    
    +   private boolean shouldEscapeOrTranslateEntities() {
    +       return escapeXml || props.isRecognizeUnicodeChars() || props.isTranslateSpecialEntities();
    +   }
    +
    }
    
     
  • Vincent Massol

    Vincent Massol - 2018-06-19

    Thanks a lot Scott for this code/diff :)

    I've tried it but it doesn't work though.

    I'll need to look at the commit 521 a bit later. Thx

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    Ok I've done more tests:

    • With HC 2.16 if I check the content of the cleaned node, I can see that the attribute has proper escaped values, i.e. aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee
    • With HC 2.22, the value is not escaped aaa"bbb&ccc>ddd<eee
    • If I replace XWikiDOMSerializer by DOMSerializer with HC 2.22, I still get the problem and the test is not passing so the issue is not coming from XWikiDOMSerializer but from the cleaner it seems.

    Any idea?

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Yes, HC will deserialize entities in attribute text when creating tagnodes, and the serializer will then apply the appropriate rules for entities depending on how its configured. (This is a change from earlier versions of HC which passed entities in attributes through untouched, and then applied escaping - there was a use case from a user for this change).

    So when I inspect the foo attribute for the DIV Tagnode,, I see as you do:

    aaa"bbb&ccc>ddd<eee

    But when I inspect the attribute value of the DIV element in the Document created by DomSerializer I see:

    aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee

    However, if I initialize the DomSerializer with escapeXml=false then I see:

    aaa"bbb&ccc>ddd<eee

    This is because of this line in DomSerializer:

                            if (escapeXml) {
                                attrValue = Utils.escapeXml(attrValue, props, true);
                            }
    

    So I suspect thats the issue? Is there a reason that escapeXml is turned off in XWiki?

    So a fix could be to include this line in XWikiDomSerializer:

    attrValue = Utils.escapeXml(attrValue, props, true);
    

    ... or to initalise the DomSerializer without setting escapeXml=false.

    Does that help?

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    Yes it helps,, thanks. I hadn't realized we were initializing the XWikiDOMSerialize with escapeXml=false (had forgotten).

    I tried putting it to true and indeed it fixes this issue, but breaks 3 other tests (which is why it was off before ;)):

    Problem 1

        @Test
        public void verifyEntitiesAreNotBroken()
        {
            assertHTML("<p>&Eacute;</p>", "&Eacute;");
        }
    

    Generates <html><head></head><body><p>É</p></body></html> instead of <html><head></head><body><p>&Eacute;</p></body></html>.

    Problem 2

        @Test
        public void specialCharacters()
        {
            // TODO: We still have a problem I think in that if there are characters such as "&" or quote in the source
            // text they are not escaped. This is because we have use "false" in DefaultHTMLCleaner here:
            // Document document = new JDomSerializer(this.cleanerProperties, false).createJDom(cleanedNode);
            // See the problem described here: http://sourceforge.net/forum/forum.php?thread_id=2243880&forum_id=637246
            assertHTML("<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>",
                "<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>");
            assertHTML("<p>\"&amp;</p>", "<p>\"&</p>");
            assertHTML("<p><img src=\"http://host.com/a.gif?a=foo&amp;b=bar\" /></p>",
                "<img src=\"http://host.com/a.gif?a=foo&b=bar\" />");
            assertHTML("<p>&#xA;</p>", "<p>&#xA;</p>");
    
            // Verify that double quotes are escaped in attribute values
            assertHTML("<p value=\"script:&quot;&quot;\"></p>", "<p value='script:\"\"'");
        }
    

    Generates <html><head></head><body><p>&quot;&amp;**notbold**&lt;notag&gt; </p></body></html> instead of <html><head></head><body><p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p></body></html>

    Problem 3

        @Test
        public void restrictedHtml()
        {
            HTMLCleanerConfiguration configuration = this.cleaner.getDefaultConfiguration();
            Map<String, String> parameters = new HashMap<String, String>();
            parameters.putAll(configuration.getParameters());
            parameters.put("restricted", "true");
            configuration.setParameters(parameters);
    
            String result = HTMLUtils.toString(this.cleaner.clean(
                new StringReader("<script>alert(\"foo\")</script>"), configuration));
            assertEquals(HEADER_FULL + "<pre>alert(\"foo\")</pre>" + FOOTER, result);
    
            result = HTMLUtils.toString(this.cleaner.clean(
                new StringReader("<style>p {color:white;}</style>"), configuration));
            assertEquals(HEADER_FULL + "<pre>p {color:white;}</pre>" + FOOTER, result);
        }
    

    Generates <html><head></head><body><pre>alert(&quot;foo&quot;)</pre></body></html> instead of <html><head></head><body><pre>alert("foo")</pre></body></html>.

    Do these tests seem valid to you?

    Thx

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    And I believe you're confirming that the cleaner has changed behavior since 2.16 and that it's normal and wanted that it doesn't escape attribute values any more, correct?

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    Note that if I use DOMSerializer instead of XWikiDOMSerializer I also get these 3 problems (and 2 more failing tests related to CDATA).

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Yes that's correct - its a change since 2.21.

    Some of the other tests you have there relate to preserving entities rather than replacing them with unicode; I think thats something that should be set in cleaner properties rather than by turning off XML escaping.

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    ... for example, if you want to retain the &Eacute; entity rather than replace it with the unicode character É, you'd set translateSpecialEntities=false.

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    Yes thanks for that hint, that fixes Problem 1 (and Problem 2 for some reason).

    I now have Problem 3 + a new one:

    Problem 4

        @Test
        public void specialCharacters()
        {
            // TODO: We still have a problem I think in that if there are characters such as "&" or quote in the source
            // text they are not escaped. This is because we have use "false" in DefaultHTMLCleaner here:
            // Document document = new JDomSerializer(this.cleanerProperties, false).createJDom(cleanedNode);
            // See the problem described here: http://sourceforge.net/forum/forum.php?thread_id=2243880&forum_id=637246
            assertHTML("<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>",
                "<p>&quot;&amp;**notbold**&lt;notag&gt;&nbsp;</p>");
            assertHTML("<p>\"&amp;</p>", "<p>\"&</p>");
            assertHTML("<p><img src=\"http://host.com/a.gif?a=foo&amp;b=bar\" /></p>",
                "<img src=\"http://host.com/a.gif?a=foo&b=bar\" />");
            assertHTML("<p>&#xA;</p>", "<p>&#xA;</p>");
    
            // Verify that double quotes are escaped in attribute values
            assertHTML("<p value=\"script:&quot;&quot;\"></p>", "<p value='script:\"\"'");
        }
    

    This time it's the second assert that fails with <html><head></head><body><p>&quot;&amp;</p></body></html> instead of <html><head></head><body><p>"&amp;</p></body></html>

    Thanks a lot for your help, much appreciated. I haven't touched this code for a while and I'm a bit rusty on it and I don't remember a lot of things....

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Yes, it gets pretty complicated when you want some entities and not others...

     
  • Vincent Massol

    Vincent Massol - 2018-06-21

    AFAIR what we wanted was XHTML so it's normal to escape ampersand (since it's one of the 5 special characters in XML) but the quote doesn't need to be escaped in XHTML.

    So are you saying there's no way to escape only what's needed for XHTML and translateSpecialEntities=true will escape more than needed?

     

    Last edit: Vincent Massol 2018-06-21
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Hmm, SimpleHtmlSerializer is a bit aggressive at replacing characters with their entity equivalents. Its because the same code is used for escaping content in attributes (where we need to be strict) and elements (where convention is to be relatively loose).

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Its all in the serializer as far as I can tell - the actual TagNode tree has the 'right' information, I think the unwanted behaviour is in the serializer.

     
  • Scott Wilson

    Scott Wilson - 2018-06-21

    Strictly speaking the serialization is correct with regards to the validity of the output, its just being a little overzealous in its use of &quot;.

     
  • Vincent Massol

    Vincent Massol - 2018-06-27

    1) I'm also getting this now:

    assertHTML("<p>\n</p>", "<p>&#xA;</p>");
    

    whereas before, I had:

    assertHTML("<p>&#xA;</p>", "<p>&#xA;</p>");
    

    (&#xA; is a line feed)

    Does that seem normal to you?

    2) Should I open an issue for the overzealous quote escape?

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2018-06-27

    For (1) do you mean that, where before you had a line feed character as a unicode entity, you are now getting a standard line feed character?

    For (2) yes I think it needs its own bug report. Its not an error as such, but its not what many users expect from a tool such as HC.

     
  • Vincent Massol

    Vincent Massol - 2018-06-27

    For (1) do you mean that, where before you had a line feed character as a unicode entity, you are now getting a standard line feed character?

    Yes correct.

     
  • Scott Wilson

    Scott Wilson - 2018-06-27

    When RecognizeUnicodeCharacters is set to True, HC will replace any unicode entities with unicode characters. If set to false, it'll leave the unicode entity references in place. (Earlier releases of HC were less consistent in applying this.)

    You may want to check if this causes any issues in further processing, but any XML/XHTML processing should treat the newline character and the newline entity reference as being equivalent.

     
  • Vincent Massol

    Vincent Massol - 2018-06-28

    I've gone ahead yesterday and applied the HC 2.22 upgrade with the change of tests. I'm crossing fingers now since it's hard to foresee what the behavior changes will lead to.

    Thx

     
  • Scott Wilson

    Scott Wilson - 2018-06-28

    Fingers crossed here too!

     
  • Scott Wilson

    Scott Wilson - 2019-09-04

    I have this test for the 2.23 snapshot based on the original bug report, which is now passing:

    // See bug #203
    @Test
    public void parse2() throws Exception
    {
        String html = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
        String expected = "<div foo=\"aaa&quot;bbb&amp;ccc&gt;ddd&lt;eee\">content</div>";
        final CleanerProperties cleanerProperties = new CleanerProperties();
        final TagNode tagNode = new HtmlCleaner().clean(html);
        cleanerProperties.setOmitHtmlEnvelope(true);
        cleanerProperties.setOmitXmlDeclaration(true);
        String out = new SimpleXmlSerializer(cleanerProperties).getAsString(html);
        assertEquals(expected, out);
    }
    
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.