Menu

#169 Several issues with CDATA blocks

v2.19
closed-fixed
nobody
None
5
2017-02-08
2016-01-22
No

Hi there (Hi Scott! :))

I'm trying to eliminate the need for a custom DomSerializer in the XWiki project, see https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java

When I try to use the DomSerializer from HTML Cleaner, I have the following errors when running the XWiki test suite for CDATA.

Input:

<script type="text/javascript"><![CDATA[
alert("Hello World")
]]></script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Actual:

<html><head></head><body><script type="text/javascript"><![CDATA[
alert("Hello World")
]]></script></body></html>

Rationale: Generate a javascript comment in front on the CDATA block so that it works in IE and when serving XHTML under a mimetype of HTML.

Some other test related to this use case:

Input:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Rationale: Verify that // are kept.

Input:

<script type="text/javascript">/*<![CDATA[*/
alert("Hello World")
/*]]>*/</script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Rationale: normalize the JS comment

Input:

<script type="text/javascript">
/*<![CDATA[*/
function escapeForXML(origtext) {
   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')
       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;')
       .replace(/\"/g,'&'+'quot;');}
/*]]>*/
</script>

Expected:

<script type="text/javascript">
//<![CDATA[
function escapeForXML(origtext) {
   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')
       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;')
       .replace(/\"/g,'&'+'quot;');}
//]]>
</script>

Rationale: Same as above on a more complex use case.

Input:

"<script>&lt;&gt;</script>

Expected:

<script>//<![CDATA[<>//]]></script>

Actual:

<script><![CDATA[&lt;&gt;]]></script>

Rationale: Same as above, generate JS comments

Input:

<script><></script>

Expected:

<script>//<![CDATA[<>//]]></script>

Actual:

<script><![CDATA[<>]]></script>

Input:

<style type="text/css"><![CDATA[
a { color: red; }
]]></style>

Expected:

<style type="text/css">//<![CDATA[
a { color: red; }
//]]></style>

Actual:

<style type="text/css"><![CDATA[
a { color: red; }
]]></style>

It would be so awesome if HTML Cleaner could support those use cases :)

WDYT? Would you agree to support those use cases?

Note that the implementation is in https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java and you could copy paste that code if you were to agree to support it. However you'd probably want to make it conditional to setting some configuration options.

Thanks a lot

Discussion

<< < 1 2 3 > >> (Page 2 of 3)
  • Scott Wilson

    Scott Wilson - 2017-02-06
    • status: open --> closed-fixed
     
  • Vincent Massol

    Vincent Massol - 2017-02-07

    Awesome Scott! I see you've fixed several issues I had raised thanks a lot for that. Eagerly waiting for 2.19 to be out to test it.

     
  • Vincent Massol

    Vincent Massol - 2017-02-07

    Hi Scott. Thanks for working on those. I've upgraded to HC 2.19 and unfortunately I still have the same problem as reported in the comments:
    https://sourceforge.net/p/htmlcleaner/bugs/169/#c7e0
    https://sourceforge.net/p/htmlcleaner/bugs/169/#6aa0

    Also, AFAICS, bug #180 is quite different.

    Could it be that this issue was closed by mistake?

    Thanks
    -Vincent

    PS: Unfortunately,because of this issue we haven't been able to upgrade HC on the xwiki project since version 2.16 (this is the last issue that worked for us). After 2.16 a regression was introduced that fails our test, leasding to https://sourceforge.net/p/htmlcleaner/bugs/169/#c7e0 (ie removal of script comments).

     
  • Scott Wilson

    Scott Wilson - 2017-02-07

    Hmm, I'm sure those tests are in the regular HC test suite. I'll take another look

     
  • Scott Wilson

    Scott Wilson - 2017-02-07

    OK, so we have 3 XWiki tests potentially failing here:

        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n\n//]]></script>", 
                "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n//]]></script>");
    
        assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\nalert(\"Hello World\")\n\n//]]></script>", 
                "<script type=\"text/javascript\"><![CDATA[\nalert(\"Hello World\")\n]]></script>");
    
        assertHTML("<style type=\"text/css\">/*<![CDATA[*/\na { color: red; }\n/*]]>*/</style>", 
                "<style type=\"text/css\"><![CDATA[\na { color: red; }\n]]></style>");
    

    The differences are:

    In test 1:
    1. HC doesn't add any extra newlines before/after CData tokens where they already exist
    2. HC always uses block comment style ("/* */")rather than inline comment style ("//")
    3. HC doesn't always preserve JS comments where they could also be safe-commenting a CData token. Thats because "//<![CDATA[" can be interpreted either as an inline-commented CData token or a JS comment followed by a CDATA token. HC uses the first interpretation.

    In test 2, the only difference is the comments style (2)

    Test 3 passes as-is.

    So I think the thing that could be considered a potential bug is point 3 about interpreting comments before a CData token - the others are more to do with the style of output. I'll have a re-read of the HTML processor docs at W3 and see if that gives me a clue as to which is the correct interpretation!

     

    Last edit: Scott Wilson 2017-02-07
  • Scott Wilson

    Scott Wilson - 2017-02-07

    For really old versions of IE you had to use this monstrosity:

    <!-- // <![CDATA[ -->
    

    However I think that all current (and previous generation) browsers support the /* */ commenting style.

     
  • Scott Wilson

    Scott Wilson - 2017-02-07

    Point 3 is one where I can't find a definitive rule in the XHTML, HTML and XML processing docs, as effectively this is a case where we're trying to use XML/XHTML-safe code when using the wrong content type :)

    So lets see what modern browsers do.

    Our input is:

    <script type="text/javascript">//<![CDATA[
    alert("Hello World")
    //]]></script>
    

    Chrome gives us:

    <script type="text/javascript">//<![CDATA[
    alert("Hello World")
    //]]></script>
    

    Firefox gives us:

    <script type="text/javascript">//<![CDATA[
    alert("Hello World")
    //]]></script>
    

    HtmlCleaner gives us:

    <script type=\"text/javascript\">/*<![CDATA[*/
    alert(\"Hello World\")
    /*]]>*/</script>
    

    XWiki version of HtmlCleaner gives us:

    <script type="text/javascript">//<![CDATA[
    //
    alert("Hello World")
    
    //]]></script>
    

    So, I think not injecting the JS comment into the CDATA section is the correct behaviour.

    Also, not adding extra newlines looks correct.

    On the commenting style, I've used block comments to ensure we don't get a change in the script meaning from misplaced newlines, but it does seem like the consensus is retain the original commenting style where one is given (Chrome and FF also retain block style if thats in the original source).

    So I think the consensus emerging here is that HC is correct on (1) and (3), but should look at altering its behaviour in (2) to try to retain the original commenting style.

     
    • Vincent Massol

      Vincent Massol - 2017-02-08

      So I think the consensus emerging here is that HC is correct on (1) and (3), but should look at altering its behaviour in (2) to try to retain the original commenting style.

      Why do you say that HC seems to be correct for (3)? AFAIU (3) is about "HC doesn't always preserve JS comments where they could also be safe-commenting a CData token. " but the browser tests you've shown above didn't remove JS comments.

      Regarding the extra new lines, I'm checking with Guillaume Delhumeau why he added that in https://github.com/xwiki/xwiki-commons/commit/bb127b08f640598bfc46f0eba4aa0f5359128991#diff-acb682688276ae941d59e302a0d34c12R141 :)

       
      • Scott Wilson

        Scott Wilson - 2017-02-08

        In each of the example cases above the CDATA token is preceded by a comment marker. However, in the XWiki example, there is an additional comment marker added after the CDATA token. This is what I mean by "preserving the JS comment" - the XWiki serializer takes the comment marker before CDATA as being separate from the CDATA token, and therefore ends up with two comment markers in the output, while all the other parsers treat the comment marker as being part of the CDATA token.

         
      • Scott Wilson

        Scott Wilson - 2017-02-08

        Trying to make this a bit clearer - it is difficult!

        We have this input text:

        //<![CDATA[
        

        We can view this semantically as either:

        • a JavaScript Inline Comment followed by a CDATA start token; i.e. two tokens
        • a safe-commented CDATA start token; i.e. a single token

        In the first case, we would move the inline comment token inside the CDATA section as its not valid outside, and then safe-comment the CDATA start token when serializing it. Because of this we effectively end up doubling the comment. This is what XWiki does currently:

        //<![CDATA[
        //
        

        In the second case, we treat the comment marker as just part of the CDATA token, and serialise it out. This is what Chrome, Firefox and HtmlCleaner do. (The only difference is that HtmlCleaner substitutes block-style commenting for inline-style commenting.):

        //<![CDATA[
        

        or

        /*<![CDATA[*/
        
         
        • Vincent Massol

          Vincent Massol - 2017-02-08

          ok so I guess we need to test your input above in IE. I've checked and XWiki supports only IE10+ from now on (http://dev.xwiki.org/xwiki/bin/view/Community/BrowserSupportStrategy).

          Guillaume is testing that.

           
  • Guillaume Delhumeau

    Hi Scott.

    Testing with IE10 (XWiki now doesn't support older versions of IE):

    <script type="text/javascript">//<![CDATA[
    alert("Hello World")
    //]]></script>
    

    See: https://jsfiddle.net/bw7xguhq/

    It works without any error.

    The same jsFiddle seems to work on IE9 (I can see "Hello World"), but actually I cannot see the fiddle content :/

     
    • Scott Wilson

      Scott Wilson - 2017-02-08

      Well thats odd. Still, at least it works...

       
    • Vincent Massol

      Vincent Massol - 2017-02-08
       
      • Guillaume Delhumeau

        Assuming you ask me to use the Web Developer tool of IE10, I get:

        <script type="text/javascript">//<![CDATA[
        alert("Hello World")
        //]]></script>
        
         
  • Vincent Massol

    Vincent Massol - 2017-02-08

    ok Scott, so I've slowly examined all the XWiki unit tests related to CData inside script or style tags and here are the problems I've noticed:

    A) HC removes or doesn't add comments where necessary. HC should add //comments for script tags in front of CDATA (and /*comments for style tags) as otherwise JS or CSS engines will choke on the CDATA element. This can be seen easily on https://jsfiddle.net/r1vkrktz/ (if you run it, you'll see that the "test" link is not in red and if you add the comments as on https://jsfiddle.net/7g64s784/ it works).

    B) HC is overzealous in some cases and adds several CDATA blocks when there should be only one suc block. Example input:

    <script type="text/javascript">//<![CDATA[
    alert("Hello World")
    //]]></script>
    

    HC gives:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript"><![CDATA[//]]><![CDATA[
    alert("Hello World")
    ]]></script></body></html>
    

    Also note the problem of ordering.

    Other input exampl showing the same problem:

    <script type="text/javascript">/*<![CDATA[*/
    function escapeForXML(origtext) {
       return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')        
            .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');}
    /*]]>*/
    </script>
    

    HC gives:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript"><![CDATA[
    ]]><![CDATA[
    function escapeForXML(origtext) {
       return origtext.replace(/\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')
           .replace(/>/g,'&'+'gt;').replace(/'/g,'&'+'apos;').replace(/"/g,'&'+'quot;');}
    ]]>
    </script></body></html>
    

    So in summary there are still 2 outstanding problems for CDATA handling IMO (including the fact that '//' comments should be used for script tags).

    Do you agree? :)

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2017-02-08

    D'oh this is my fault. I've been testing using all the serializers apart from DomSerializer. It all works absolutely fine using HtmlSerializer and JDomSerializer, its just DomSerializer that does this..!

     
  • Scott Wilson

    Scott Wilson - 2017-02-08

    Right, duplicating all CDATA test cases using DomSerializer as well as the other serializers...

     
  • Vincent Massol

    Vincent Massol - 2017-02-08

    Thanks for looking into this Scott!

     
  • Scott Wilson

    Scott Wilson - 2017-02-08

    No problem! Hopefully now I know where the problem is, it shouldn't take a long time to fix.

     
  • Scott Wilson

    Scott Wilson - 2017-02-08

    OK, I've aligned all the serializers using the exact same unit tests for CDATA. If you can try a checkout and build from HEAD that should now give you a different result.

     
  • Vincent Massol

    Vincent Massol - 2017-02-08

    ok I've retested and it's much better :)

    Remaining issues for XWiki:

    • Using //instead of /**/for script tags but this is not very important I think. Although I hope that there aren't any tool parsing results of XWiki rendering that would fai because we would change the comment style.
    • Unescape entities when inside CDATA blocks. From what I understand, you said that I should be able to do this by extending DomSerializer and overriding createSubnodes(). However I've just checked createSubnodes() and it has a lot of code inside. If I need to copy paste all this code in my extension then it's going to be painful to update when you make changes in HC in the future. Would there be a way to have a config option similar to setDeserializeEntities() but only for CDATA content? :) Second best option would be a way for me to extend more easily into DomSerializer to add this behavior.

    Thanks!

     
<< < 1 2 3 > >> (Page 2 of 3)

Log in to post a comment.