HtmlCleaner / Bugs / #33 CDATA blocks are not recognized

Vincent Massol - 2009-04-14

See also https://sourceforge.net/tracker/?func=detail&aid=2761963&group_id=183053&atid=903696

One problem is in HtmlTokenizer, in content() which breaks if the "<" char is found. However when inside a CDATA block this shouldn't happen.

Example:

"<script type=\"text/javascript\">\n"
+ "// <![CDATA[\n"
+ "function escapeForXML(origtext) {\n"
+ " return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
+ " .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
+ "}\n"
+ "// ]]>\n"
+ "</script>"

This generates 3 ContentToken but it should generate only one.

Then the Serializer fail since it creates 3 CDATA sections if useCData is true:

} else if (item instanceof ContentToken) {
String nodeName = element.getNodeName();
ContentToken contentToken = (ContentToken) item;
String content = contentToken.getContent();
boolean specialCase =
props.isUseCdataForScriptAndStyle()
&& ("script".equalsIgnoreCase(nodeName) || "style".equalsIgnoreCase(nodeName));
if (escapeXml && !specialCase) {
content = Utils.escapeXml(content, props, true);
}
element.appendChild(specialCase ? document.createCDATASection(content) : document
.createTextNode(content));

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2009-04-14

It'll also fail if useCdata is set to true and there are any "<" char in the script element.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-05-24

It would really be cool if HTMLCleaner could fix CDATA handling in DomSerializer for 2.6 :)

FYI here's the code that makes it work nicely:
https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-05-24

It would be even cooler if someone submitted a patch :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-05-24

This is what I've done Scott. I've provided the full solution! You just need to replace your DOMSerializer by this one:

https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java

:)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-08-05

Thanks Vincent - and sorry for the long delay in replying :)

We'll definitely look at merging in the improved DomSerializer from XWiki

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-08-05

Thanks Scott. I'm eagerly waiting for it :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-08-11

assigned_to: Scott Wilson

Group: --> v 2.7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-08-11

Merge in for 2.7

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-24

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-24

I've modified the tokenizer so that it no longer breaks up content within CDATA sections when it encounters a '<' or '>' - so a content section containing CDATA will end up as a single ContentNode rather than multiple when the enclosed CDATA has those characters in it. I've also merged in the improvements to DomSerializer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-09-24

woohoo :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heheh, don't celebrate too soon :)

I've not merged in one part of the improvements as I'm not entirely sure how they'll play out:

            } else if (item instanceof ContentNode) {
                ContentNode contentNode = (ContentNode) item;
                String content = contentNode.getContent();
                boolean specialCase = dontEscape(element);
                if (escapeXml && !specialCase) {
                    content = Utils.escapeXml(content, props, true);
                }
                element.appendChild( specialCase ? document.createCDATASection(content) : document.createTextNode(content) );

(HC 2.7-SNAPSHOT)

vs.

            } else if (item instanceof ContentNode) {
                ContentNode contentToken = (ContentNode) item;
                bufferedContent.append(contentToken.getContent());

(XWIKI)

Though as content nodes are no longer being split, maybe this isn't such a big deal?

Ok I've just tested it on our project at https://github.com/xwiki/xwiki-commons/tree/master/xwiki-commons-core/xwiki-commons-xml and unfortunately it fails :(

Running org.xwiki.xml.internal.html.DefaultHTMLCleanerTest

scriptAndCData(org.xwiki.xml.internal.html.DefaultHTMLCleanerTest)  Time elapsed: 0.031 sec  <<< ERROR!
org.jdom.IllegalDataException: The data "//<![CDATA[
alert("Hello World")
// ]]>" is not legal for a JDOM CDATA section: CDATA cannot internally contain a CDATA ending delimiter (]]>).
    at org.jdom.CDATA.setText(CDATA.java:121)
    at org.jdom.CDATA.<init>(CDATA.java:96)
    at org.jdom.DefaultJDOMFactory.cdata(DefaultJDOMFactory.java:97)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:373)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:173)
    at org.jdom.input.DOMBuilder.build(DOMBuilder.java:138)
    at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:194)
    at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:180)
    at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.assertHTML(DefaultHTMLCleanerTest.java:279)
    at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.scriptAndCData(DefaultHTMLCleanerTest.java:187)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.xwiki.test.ComponentManagerRule$1.evaluate(ComponentManagerRule.java:44)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158)
    at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)

This test is:

    /**
     * Verify that scripts are not cleaned and that we can have a CDATA section inside. Also verify CDATA behaviors.
     */
    @Test
    public void scriptAndCData()
    {
        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n// \n//]]></script>", 
            "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n// ]]></script>");

        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\n"
            + "// \n"
            + "function escapeForXML(origtext) {\n"
            + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
            + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
            + "}\n"
            + "// \n\n//]]>"
            + "</script>", "<script type=\"text/javascript\">\n"
            + "// <![CDATA[\n"
            + "function escapeForXML(origtext) {\n"
            + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
            + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
            + "}\n"
            + "// ]]>\n"
            + "</script>");

        assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script>&lt;&gt;</script>");
        assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script><></script>");

        // Verify that CDATA not inside SCRIPT or STYLE elements are considered comments in HTML and thus stripped
        // when cleaned.
        assertHTML("<p/>", "<p><![CDATA[&]]></p>");
        assertHTML("<p>&amp;&amp;</p>", "<p>&<![CDATA[&]]>&</p>");
    }

Scott Wilson - 2013-09-24

Ah got it, the content nodes contain CDATA start/end tokens

I've created a CData token class so we can process them separately from regular Content Node instances and treat them differently in the serializers - lets see if that works out better

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-24

OK, I've committed the latest changes. I've modified the HtmlTokenizer to recognise CData sections and create a CData instance rather than a ContentNode instance. These can then be output separately from regular ContentNodes, with or without safe start/end tokens. Give it a go, and see how it breaks this time ;)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-24

(oh, and thank you very much Vincent for your patience in helping me fix this)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-24

status: closed-fixed --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-09-25

I've retested with your latest changes and it's still not good. Actually it seems you've brought another regression to https://sourceforge.net/tracker/index.php?func=detail&aid=2062318&group_id=183053&atid=903696 which was working before your last changes.

So right now I get the following 2 problems:
* missing DOCTYPE
* removed comments

screenshot of diff: https://www.evernote.com/shard/s119/sh/62453483-dc9e-4dcd-85e6-76e343d88990/b4c98a06f8a68a23eab51b85db97a685

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-25

OK thats odd; the DOCTYPE is correctly picked up by HtmlTokenizer but it isn't in the serialization output even with OmitDoctypeDeclaration==false.

(Adding more test cases to try and pick up these kind of regressions...)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-25

I've added another test for this one - I get slightly different results:

comparison .png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

This is my test:

    @Test
    public void scriptAndCData()
    {
        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n// \n//]]></script>", 
            "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n// ]]></script>");

        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\n"
            + "// \n"
            + "function escapeForXML(origtext) {\n"
            + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
            + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
            + "}\n"
            + "// \n\n//]]>"
            + "</script>", "<script type=\"text/javascript\">\n"
            + "// <![CDATA[\n"
            + "function escapeForXML(origtext) {\n"
            + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
            + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
            + "}\n"
            + "// ]]>\n"
            + "</script>");

        assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script>&lt;&gt;</script>");
        assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script><></script>");

        // Verify that CDATA not inside SCRIPT or STYLE elements are considered comments in HTML and thus stripped
        // when cleaned.
        assertHTML("<p/>", "<p><![CDATA[&]]></p>");
        assertHTML("<p>&amp;&amp;</p>", "<p>&<![CDATA[&]]>&</p>");
    }

and assertHTML is just:

    public static final String HEADER =
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" "
            + "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n";

    private static final String HEADER_FULL = HEADER + "<html><head></head><body>";

    private static final String FOOTER = "</body></html>\n";
...

    private void assertHTML(String expected, String actual)
    {
        Assert.assertEquals(HEADER_FULL + expected + FOOTER,
            HTMLUtils.toString(this.cleaner.clean(new StringReader(actual))));
    }

~~~~~~

Scott Wilson - 2013-09-25

Ah OK. So in this case you're expecting there to be a default DOCTYPE added where there is none specified in the input?

I've also had a further look into this... part of the problem is that XMLSerializer tries to wrap the whole contents of a <script> tag in CDATA declarations - even if it contains child CDATA. However, nested CDATA is not valid! </p> <p>One solution is to individually wrap ContentNode instances inside <script> and <style> with CDATA start and end tokens, even though this results in some pretty ugly (but valid) results. Another is to try and cleverly collapse it into a single set of CDATA tokens around the whole of the content without any nesting (even where it exists in the input). </p></script>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-09-25

missing end of sentence in your last comment :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-09-25

sigh the new SF site (aka Apache Allura) is not without its bugs either

OK, what I was on about was that at present XmlSerializer wraps the content of <script> and <style> in CDATA tokens, even if that includes CDATA content (and CDATA cannot be nested). Anyway, I think I'm making progress at fixing this gnarly problem - thanks for the test cases, they're a big help.</p></script>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CDATA blocks are not recognized

Group

Searches

Help

#33 CDATA blocks are not recognized

Discussion