Hi there (Hi Scott! :))
I'm trying to eliminate the need for a custom DomSerializer in the XWiki project, see https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java
When I try to use the DomSerializer from HTML Cleaner, I have the following errors when running the XWiki test suite for CDATA.
Input:
<script type="text/javascript"><![CDATA[ alert("Hello World") ]]></script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Actual:
<html><head></head><body><script type="text/javascript"><![CDATA[ alert("Hello World") ]]></script></body></html>
Rationale: Generate a javascript comment in front on the CDATA block so that it works in IE and when serving XHTML under a mimetype of HTML.
Some other test related to this use case:
Input:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Rationale: Verify that // are kept.
Input:
<script type="text/javascript">/*<![CDATA[*/ alert("Hello World") /*]]>*/</script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Rationale: normalize the JS comment
Input:
<script type="text/javascript"> /*<![CDATA[*/ function escapeForXML(origtext) { return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;') .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;') .replace(/\"/g,'&'+'quot;');} /*]]>*/ </script>
Expected:
<script type="text/javascript"> //<![CDATA[ function escapeForXML(origtext) { return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;') .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;') .replace(/\"/g,'&'+'quot;');} //]]> </script>
Rationale: Same as above on a more complex use case.
Input:
"<script><></script>
Expected:
<script>//<![CDATA[<>//]]></script>
Actual:
<script><![CDATA[<>]]></script>
Rationale: Same as above, generate JS comments
Input:
<script><></script>
Expected:
<script>//<![CDATA[<>//]]></script>
Actual:
<script><![CDATA[<>]]></script>
Input:
<style type="text/css"><![CDATA[ a { color: red; } ]]></style>
Expected:
<style type="text/css">//<![CDATA[ a { color: red; } //]]></style>
Actual:
<style type="text/css"><![CDATA[ a { color: red; } ]]></style>
It would be so awesome if HTML Cleaner could support those use cases :)
WDYT? Would you agree to support those use cases?
Note that the implementation is in https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java and you could copy paste that code if you were to agree to support it. However you'd probably want to make it conditional to setting some configuration options.
Thanks a lot
Note: XWikiDOMSerialized has now been moved to the org.htmlcleaner package to circumvent https://sourceforge.net/p/htmlcleaner/bugs/167/
New location for source code: https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/htmlcleaner/XWikiDOMSerializer.java
Hi Vincent!
Yes, looks like a reasonable set of use cases; safe-commenting of CDATA is supported in other serializers after all.
cool!
Just had a look at this - the current implementation in HC uses the JDOM2 native CData class to represent any CData nodes. It doesn't inject any comments as this is left to whatever you use to output the Document. That seems like the right kind of behaviour, but it doesn't seem to do what you want?
Right, some more digging...
Basically yes, if you have CDATA section in the source HTML, HC will generate an org.jdom2.CDATA object for it in the output Document.
When you then use JDom's XMLOutputter, this will render the content of the object within non-commented CDATA markers.
My first thought was there had to be an option in XMLOutputter to make those markers commented out for safe use within XHTML. Nope, can't see one.
My second thought was to override XMLOutputter.outputCData to use safe markers. Nope; XMLOutputter is marked as Final for some reason :(
So I'm left with the same implementation option as used in XWiki - stick in some Text nodes either side of the CDATA to wrap it in comments. It feels a bit icky to do it this way, but JDOM doesn't seem to have a "safe-CDATA" option.
Thanks for looking into this Scott! Yes, I couldn't find another solution either at the time...
OK, I've checked some changes into Trunk that seem to cover the test cases, apart from this one:
Input:
Expected:
Actual:
Regarding this last test case, you want to support it right (it's just not done yet)?
I'm undecided to be honest; it assumes that text content is processed for entity resolution before a CDATA section is created around it.
I probably don't understand the issue. At first glance, if you ask for useCdataFor="script,style" then I don't see why you wouldn't have a CDATA portion inside the the script tag.
Oh I think I understand. The question is not whether to have CDATA or not. It's whether & lt; & gt; is evaluated first or not. Well that's easy. Entities are not valid inside the script block. Just tried in chome and I get an error in the javascript console.
Last edit: Vincent Massol 2016-02-23
The CDATA section markers (and the comments around them) aren't an issue - its why I would convert
<
into<
in this contextSee http://stackoverflow.com/a/4227942
There is a general way to do this, which is to use
This will deserialize XML entities on input; they'll then be escaped (unless they are in script tags or other special locations) on output.
May be too blunt an instrument though; I suspect your intention is to only deserialise XML entities within script tags when we apply a CDATA wrapper on output.
However, even then we can get edge cases like:
... which would be rendered incorrectly by applying either of these approaches.
If the original content of the script tag is invalid, I'm not sure there's much you can do. I'd simply surround its content with CDATA when useCdataFor is used.
Well, at the moment all the other tests pass OK; if you need to handle deserialisation of entities within script tags that can still be done without needing to subclass JDomSerializer. I've committed the changes to trunk, so you can test it out if you build a snapshot. I'll get another release out in March.
I've rebuilt and tested trunk without making any modification to xwiki (ie I still have our custom XWikiDOMSerializer). I have one test failing:
Expected:
Actual:
I'll now remove our XWikiDOMSerializer and see what I get.
Last edit: Vincent Massol 2016-03-03
Ok after removing our custom XWikiDOMSerializer I get several failing tests. Just to list 2:
Expected:
Actual:
Other test:
Expected:
Actual:
Thanks
Thanks Vincent - I'll check those out asap. Hopefully we can get a release out at the end of next week.
Thanks Scott, that would be awesome to have all our CDATA test pass with htmlcleaner. Thanks for your work
i'm not sure if this is related but the following snippet has missing closing cdata, and htmlcleaner takes html past the closing script tag as cdata.
the expected behaviour is that the script tag has precedence of cdata.
<script type="text/javascript">
/ <![CDATA[ /
function completeSearchEngine()
{
if (document.getElementById("form_search") && document.getElementById("keywords_field") && document.getElementById("submit_search")) {
initAutoComplete(document.getElementById("form_search"),document.getElementById("keywords_field"),document.getElementById("submit_search"));
}
}
$(
function()
{
$("#form_search").submit(
function(event)
{
if($("#keywords_field").val() == "")
{
event.preventDefault();
}
}
)
}
);
</script>
Fixed that last issue Haadar raised - its covered in bug #180.