Menu

#33 CDATA blocks are not recognized

v 2.7
closed-fixed
None
5
2013-12-06
2009-03-18
No

HTML Cleaner should recognize CDATA blocks and thus it shouldn't escape any character inside them.

Discussion

1 2 > >> (Page 1 of 2)
  • Vincent Massol

    Vincent Massol - 2009-04-14

    See also https://sourceforge.net/tracker/?func=detail&aid=2761963&group_id=183053&atid=903696

    One problem is in HtmlTokenizer, in content() which breaks if the "<" char is found. However when inside a CDATA block this shouldn't happen.

    Example:

    "<script type=\"text/javascript\">\n"
    + "// <![CDATA[\n"
    + "function escapeForXML(origtext) {\n"
    + " return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
    + " .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
    + "}\n"
    + "// ]]>\n"
    + "</script>"

    This generates 3 ContentToken but it should generate only one.

    Then the Serializer fail since it creates 3 CDATA sections if useCData is true:

    } else if (item instanceof ContentToken) {
    String nodeName = element.getNodeName();
    ContentToken contentToken = (ContentToken) item;
    String content = contentToken.getContent();
    boolean specialCase =
    props.isUseCdataForScriptAndStyle()
    && ("script".equalsIgnoreCase(nodeName) || "style".equalsIgnoreCase(nodeName));
    if (escapeXml && !specialCase) {
    content = Utils.escapeXml(content, props, true);
    }
    element.appendChild(specialCase ? document.createCDATASection(content) : document
    .createTextNode(content));

     
  • Vincent Massol

    Vincent Massol - 2009-04-14

    It'll also fail if useCdata is set to true and there are any "<" char in the script element.

     
  • Scott Wilson

    Scott Wilson - 2013-05-24

    It would be even cooler if someone submitted a patch :)

     
  • Scott Wilson

    Scott Wilson - 2013-08-05

    Thanks Vincent - and sorry for the long delay in replying :)

    We'll definitely look at merging in the improved DomSerializer from XWiki

     
  • Vincent Massol

    Vincent Massol - 2013-08-05

    Thanks Scott. I'm eagerly waiting for it :)

     
  • Scott Wilson

    Scott Wilson - 2013-08-11
    • assigned_to: Scott Wilson
    • Group: --> v 2.7
     
  • Scott Wilson

    Scott Wilson - 2013-08-11

    Merge in for 2.7

     
  • Scott Wilson

    Scott Wilson - 2013-09-24
    • status: open --> closed-fixed
     
  • Scott Wilson

    Scott Wilson - 2013-09-24

    I've modified the tokenizer so that it no longer breaks up content within CDATA sections when it encounters a '<' or '>' - so a content section containing CDATA will end up as a single ContentNode rather than multiple when the enclosed CDATA has those characters in it. I've also merged in the improvements to DomSerializer.

     
  • Vincent Massol

    Vincent Massol - 2013-09-24

    woohoo :)

     
  • Scott Wilson

    Scott Wilson - 2013-09-24

    Heheh, don't celebrate too soon :)

    I've not merged in one part of the improvements as I'm not entirely sure how they'll play out:

                } else if (item instanceof ContentNode) {
                    ContentNode contentNode = (ContentNode) item;
                    String content = contentNode.getContent();
                    boolean specialCase = dontEscape(element);
                    if (escapeXml && !specialCase) {
                        content = Utils.escapeXml(content, props, true);
                    }
                    element.appendChild( specialCase ? document.createCDATASection(content) : document.createTextNode(content) );
    

    (HC 2.7-SNAPSHOT)

    vs.

                } else if (item instanceof ContentNode) {
                    ContentNode contentToken = (ContentNode) item;
                    bufferedContent.append(contentToken.getContent());
    

    (XWIKI)

    Though as content nodes are no longer being split, maybe this isn't such a big deal?

     
  • Vincent Massol

    Vincent Massol - 2013-09-24

    Ok I've just tested it on our project at https://github.com/xwiki/xwiki-commons/tree/master/xwiki-commons-core/xwiki-commons-xml and unfortunately it fails :(

    Running org.xwiki.xml.internal.html.DefaultHTMLCleanerTest
    
    scriptAndCData(org.xwiki.xml.internal.html.DefaultHTMLCleanerTest)  Time elapsed: 0.031 sec  <<< ERROR!
    org.jdom.IllegalDataException: The data "//<![CDATA[
    alert("Hello World")
    // ]]>" is not legal for a JDOM CDATA section: CDATA cannot internally contain a CDATA ending delimiter (]]>).
        at org.jdom.CDATA.setText(CDATA.java:121)
        at org.jdom.CDATA.<init>(CDATA.java:96)
        at org.jdom.DefaultJDOMFactory.cdata(DefaultJDOMFactory.java:97)
        at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:373)
        at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
        at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
        at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360)
        at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:173)
        at org.jdom.input.DOMBuilder.build(DOMBuilder.java:138)
        at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:194)
        at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:180)
        at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.assertHTML(DefaultHTMLCleanerTest.java:279)
        at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.scriptAndCData(DefaultHTMLCleanerTest.java:187)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.xwiki.test.ComponentManagerRule$1.evaluate(ComponentManagerRule.java:44)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)
    

    This test is:

        /**
         * Verify that scripts are not cleaned and that we can have a CDATA section inside. Also verify CDATA behaviors.
         */
        @Test
        public void scriptAndCData()
        {
            assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n// \n//]]></script>", 
                "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n// ]]></script>");
    
            assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\n"
                + "// \n"
                + "function escapeForXML(origtext) {\n"
                + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
                + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
                + "}\n"
                + "// \n\n//]]>"
                + "</script>", "<script type=\"text/javascript\">\n"
                + "// <![CDATA[\n"
                + "function escapeForXML(origtext) {\n"
                + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
                + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
                + "}\n"
                + "// ]]>\n"
                + "</script>");
    
            assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script>&lt;&gt;</script>");
            assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script><></script>");
    
            // Verify that CDATA not inside SCRIPT or STYLE elements are considered comments in HTML and thus stripped
            // when cleaned.
            assertHTML("<p/>", "<p><![CDATA[&]]></p>");
            assertHTML("<p>&amp;&amp;</p>", "<p>&<![CDATA[&]]>&</p>");
        }
    
     
  • Scott Wilson

    Scott Wilson - 2013-09-24

    Ah got it, the content nodes contain CDATA start/end tokens

    I've created a CData token class so we can process them separately from regular Content Node instances and treat them differently in the serializers - lets see if that works out better

     
  • Scott Wilson

    Scott Wilson - 2013-09-24

    OK, I've committed the latest changes. I've modified the HtmlTokenizer to recognise CData sections and create a CData instance rather than a ContentNode instance. These can then be output separately from regular ContentNodes, with or without safe start/end tokens. Give it a go, and see how it breaks this time ;)

     
  • Scott Wilson

    Scott Wilson - 2013-09-24

    (oh, and thank you very much Vincent for your patience in helping me fix this)

     
  • Scott Wilson

    Scott Wilson - 2013-09-24
    • status: closed-fixed --> open-accepted
     
  • Scott Wilson

    Scott Wilson - 2013-09-25

    OK thats odd; the DOCTYPE is correctly picked up by HtmlTokenizer but it isn't in the serialization output even with OmitDoctypeDeclaration==false.

    (Adding more test cases to try and pick up these kind of regressions...)

     
  • Scott Wilson

    Scott Wilson - 2013-09-25

    I've added another test for this one - I get slightly different results:

     
  • Vincent Massol

    Vincent Massol - 2013-09-25

    This is my test:

        @Test
        public void scriptAndCData()
        {
            assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n// \n//]]></script>", 
                "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n// ]]></script>");
    
            assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\n"
                + "// \n"
                + "function escapeForXML(origtext) {\n"
                + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
                + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
                + "}\n"
                + "// \n\n//]]>"
                + "</script>", "<script type=\"text/javascript\">\n"
                + "// <![CDATA[\n"
                + "function escapeForXML(origtext) {\n"
                + "   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
                + "       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
                + "}\n"
                + "// ]]>\n"
                + "</script>");
    
            assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script>&lt;&gt;</script>");
            assertHTML("<script>//<![CDATA[\n<>\n//]]></script>", "<script><></script>");
    
            // Verify that CDATA not inside SCRIPT or STYLE elements are considered comments in HTML and thus stripped
            // when cleaned.
            assertHTML("<p/>", "<p><![CDATA[&]]></p>");
            assertHTML("<p>&amp;&amp;</p>", "<p>&<![CDATA[&]]>&</p>");
        }
    

    and assertHTML is just:

        public static final String HEADER =
            "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" "
                + "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n";
    
        private static final String HEADER_FULL = HEADER + "<html><head></head><body>";
    
        private static final String FOOTER = "</body></html>\n";
    ...
    
        private void assertHTML(String expected, String actual)
        {
            Assert.assertEquals(HEADER_FULL + expected + FOOTER,
                HTMLUtils.toString(this.cleaner.clean(new StringReader(actual))));
        }
    

    ~~~~~~

     
  • Scott Wilson

    Scott Wilson - 2013-09-25

    Ah OK. So in this case you're expecting there to be a default DOCTYPE added where there is none specified in the input?

    I've also had a further look into this... part of the problem is that XMLSerializer tries to wrap the whole contents of a <script> tag in CDATA declarations - even if it contains child CDATA. However, nested CDATA is not valid! </p> <p>One solution is to individually wrap ContentNode instances inside <script> and <style> with CDATA start and end tokens, even though this results in some pretty ugly (but valid) results. Another is to try and cleverly collapse it into a single set of CDATA tokens around the whole of the content without any nesting (even where it exists in the input). </p></script>

     
  • Vincent Massol

    Vincent Massol - 2013-09-25

    missing end of sentence in your last comment :)

     
  • Scott Wilson

    Scott Wilson - 2013-09-25

    sigh the new SF site (aka Apache Allura) is not without its bugs either

    OK, what I was on about was that at present XmlSerializer wraps the content of <script> and <style> in CDATA tokens, even if that includes CDATA content (and CDATA cannot be nested). Anyway, I think I'm making progress at fixing this gnarly problem - thanks for the test cases, they're a big help.</p></script>

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB