Menu

#169 Several issues with CDATA blocks

v2.19
closed-fixed
nobody
None
5
2017-02-08
2016-01-22
No

Hi there (Hi Scott! :))

I'm trying to eliminate the need for a custom DomSerializer in the XWiki project, see https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java

When I try to use the DomSerializer from HTML Cleaner, I have the following errors when running the XWiki test suite for CDATA.

Input:

<script type="text/javascript"><![CDATA[
alert("Hello World")
]]></script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Actual:

<html><head></head><body><script type="text/javascript"><![CDATA[
alert("Hello World")
]]></script></body></html>

Rationale: Generate a javascript comment in front on the CDATA block so that it works in IE and when serving XHTML under a mimetype of HTML.

Some other test related to this use case:

Input:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Rationale: Verify that // are kept.

Input:

<script type="text/javascript">/*<![CDATA[*/
alert("Hello World")
/*]]>*/</script>

Expected:

<script type="text/javascript">//<![CDATA[
alert("Hello World")
//]]></script>

Rationale: normalize the JS comment

Input:

<script type="text/javascript">
/*<![CDATA[*/
function escapeForXML(origtext) {
   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')
       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;')
       .replace(/\"/g,'&'+'quot;');}
/*]]>*/
</script>

Expected:

<script type="text/javascript">
//<![CDATA[
function escapeForXML(origtext) {
   return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')
       .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;')
       .replace(/\"/g,'&'+'quot;');}
//]]>
</script>

Rationale: Same as above on a more complex use case.

Input:

"<script>&lt;&gt;</script>

Expected:

<script>//<![CDATA[<>//]]></script>

Actual:

<script><![CDATA[&lt;&gt;]]></script>

Rationale: Same as above, generate JS comments

Input:

<script><></script>

Expected:

<script>//<![CDATA[<>//]]></script>

Actual:

<script><![CDATA[<>]]></script>

Input:

<style type="text/css"><![CDATA[
a { color: red; }
]]></style>

Expected:

<style type="text/css">//<![CDATA[
a { color: red; }
//]]></style>

Actual:

<style type="text/css"><![CDATA[
a { color: red; }
]]></style>

It would be so awesome if HTML Cleaner could support those use cases :)

WDYT? Would you agree to support those use cases?

Note that the implementation is in https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java and you could copy paste that code if you were to agree to support it. However you'd probably want to make it conditional to setting some configuration options.

Thanks a lot

Discussion

1 2 3 > >> (Page 1 of 3)
  • Scott Wilson

    Scott Wilson - 2016-02-01

    Hi Vincent!

    Yes, looks like a reasonable set of use cases; safe-commenting of CDATA is supported in other serializers after all.

     
  • Vincent Massol

    Vincent Massol - 2016-02-01

    cool!

     
  • Scott Wilson

    Scott Wilson - 2016-02-23

    Just had a look at this - the current implementation in HC uses the JDOM2 native CData class to represent any CData nodes. It doesn't inject any comments as this is left to whatever you use to output the Document. That seems like the right kind of behaviour, but it doesn't seem to do what you want?

     
  • Scott Wilson

    Scott Wilson - 2016-02-23

    Right, some more digging...

    Basically yes, if you have CDATA section in the source HTML, HC will generate an org.jdom2.CDATA object for it in the output Document.

    When you then use JDom's XMLOutputter, this will render the content of the object within non-commented CDATA markers.

    My first thought was there had to be an option in XMLOutputter to make those markers commented out for safe use within XHTML. Nope, can't see one.

    My second thought was to override XMLOutputter.outputCData to use safe markers. Nope; XMLOutputter is marked as Final for some reason :(

    So I'm left with the same implementation option as used in XWiki - stick in some Text nodes either side of the CDATA to wrap it in comments. It feels a bit icky to do it this way, but JDOM doesn't seem to have a "safe-CDATA" option.

     
  • Vincent Massol

    Vincent Massol - 2016-02-23

    Thanks for looking into this Scott! Yes, I couldn't find another solution either at the time...

     
  • Scott Wilson

    Scott Wilson - 2016-02-23

    OK, I've checked some changes into Trunk that seem to cover the test cases, apart from this one:

    Input:

    "<script>&lt;&gt;</script>
    

    Expected:

    <script>//<![CDATA[<>//]]></script>
    

    Actual:

    <script>//<![CDATA[&lt;&gt;//]]></script>
    
     
  • Vincent Massol

    Vincent Massol - 2016-02-23

    Regarding this last test case, you want to support it right (it's just not done yet)?

     
  • Scott Wilson

    Scott Wilson - 2016-02-23

    I'm undecided to be honest; it assumes that text content is processed for entity resolution before a CDATA section is created around it.

     
  • Vincent Massol

    Vincent Massol - 2016-02-23

    I probably don't understand the issue. At first glance, if you ask for useCdataFor="script,style" then I don't see why you wouldn't have a CDATA portion inside the the script tag.

     
    • Vincent Massol

      Vincent Massol - 2016-02-23

      Oh I think I understand. The question is not whether to have CDATA or not. It's whether & lt; & gt; is evaluated first or not. Well that's easy. Entities are not valid inside the script block. Just tried in chome and I get an error in the javascript console.

       

      Last edit: Vincent Massol 2016-02-23
  • Scott Wilson

    Scott Wilson - 2016-02-23

    The CDATA section markers (and the comments around them) aren't an issue - its why I would convert &lt; into < in this context

     
  • Vincent Massol

    Vincent Massol - 2016-02-23
     
  • Scott Wilson

    Scott Wilson - 2016-02-24

    There is a general way to do this, which is to use

    properties.setDeserializeEntities(true);
    

    This will deserialize XML entities on input; they'll then be escaped (unless they are in script tags or other special locations) on output.

    May be too blunt an instrument though; I suspect your intention is to only deserialise XML entities within script tags when we apply a CDATA wrapper on output.

    However, even then we can get edge cases like:

    <script>alert("This is an angle bracket represented as entities: &lt; &gt;")</script>
    

    ... which would be rendered incorrectly by applying either of these approaches.

     
  • Vincent Massol

    Vincent Massol - 2016-02-24

    If the original content of the script tag is invalid, I'm not sure there's much you can do. I'd simply surround its content with CDATA when useCdataFor is used.

     
  • Scott Wilson

    Scott Wilson - 2016-02-24

    Well, at the moment all the other tests pass OK; if you need to handle deserialisation of entities within script tags that can still be done without needing to subclass JDomSerializer. I've committed the changes to trunk, so you can test it out if you build a snapshot. I'll get another release out in March.

     
  • Vincent Massol

    Vincent Massol - 2016-03-03

    I've rebuilt and tested trunk without making any modification to xwiki (ie I still have our custom XWikiDOMSerializer). I have one test failing:

    assertHTML("<script type=\"text/javascript\">//<![CDATA[\n//\nalert(\"Hello World\")\n\n//]]></script>", "<script type=\"text/javascript\">//<![CDATA[\nalert(\"Hello World\")\n//]]></script>");
    

    Expected:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript">//<![CDATA[
    //
    alert("Hello World")
    
    //]]></script></body></html>
    

    Actual:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript">//<![CDATA[
    
    alert("Hello World")
    
    //]]></script></body></html>
    

    I'll now remove our XWikiDOMSerializer and see what I get.

     

    Last edit: Vincent Massol 2016-03-03
  • Vincent Massol

    Vincent Massol - 2016-03-03

    Ok after removing our custom XWikiDOMSerializer I get several failing tests. Just to list 2:

    assertHTML("<script type=\"text/javascript\">//<![CDATA[\n\nalert(\"Hello World\")\n\n//]]></script>", "<script type=\"text/javascript\"><![CDATA[\nalert(\"Hello World\")\n]]></script>");
    

    Expected:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript">//<![CDATA[
    
    alert("Hello World")
    
    //]]></script></body></html>
    

    Actual:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body><script type="text/javascript"><![CDATA[
    alert("Hello World")
    ]]></script></body></html>
    

    Other test:

    assertHTMLWithHeadContent("<style type=\"text/css\">/*<![CDATA[*/\na { color: red; }\n/*]]>*/</style>", "<style type=\"text/css\"><![CDATA[\na { color: red; }\n]]></style>");
    

    Expected:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head><style type="text/css">/*<![CDATA[*/
    a { color: red; }
    /*]]>*/</style></head><body></body></html>
    

    Actual:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head><style type="text/css"><![CDATA[
    a { color: red; }
    ]]></style></head><body></body></html>
    

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2016-03-09

    Thanks Vincent - I'll check those out asap. Hopefully we can get a release out at the end of next week.

     
  • Vincent Massol

    Vincent Massol - 2016-03-14

    Thanks Scott, that would be awesome to have all our CDATA test pass with htmlcleaner. Thanks for your work

     
  • Haadar

    Haadar - 2016-03-16

    i'm not sure if this is related but the following snippet has missing closing cdata, and htmlcleaner takes html past the closing script tag as cdata.
    the expected behaviour is that the script tag has precedence of cdata.
    <script type="text/javascript">
    / <![CDATA[ /
    function completeSearchEngine()
    {
    if (document.getElementById("form_search") && document.getElementById("keywords_field") && document.getElementById("submit_search")) {
    initAutoComplete(document.getElementById("form_search"),document.getElementById("keywords_field"),document.getElementById("submit_search"));
    }
    }
    $(
    function()
    {
    $("#form_search").submit(
    function(event)
    {
    if($("#keywords_field").val() == "")
    {
    event.preventDefault();
    }
    }
    )
    }
    );
    </script>

     
  • Scott Wilson

    Scott Wilson - 2016-08-17
    • Group: v2.16 --> v2.17
     
  • Scott Wilson

    Scott Wilson - 2016-10-19
    • Group: v2.17 --> v2.18
     
  • Scott Wilson

    Scott Wilson - 2017-02-06
    • Group: v2.18 --> v2.19
     
  • Scott Wilson

    Scott Wilson - 2017-02-06

    Fixed that last issue Haadar raised - its covered in bug #180.

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.