Hi there (Hi Scott! :))
I'm trying to eliminate the need for a custom DomSerializer in the XWiki project, see https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java
When I try to use the DomSerializer from HTML Cleaner, I have the following errors when running the XWiki test suite for CDATA.
Input:
<script type="text/javascript"><![CDATA[ alert("Hello World") ]]></script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Actual:
<html><head></head><body><script type="text/javascript"><![CDATA[ alert("Hello World") ]]></script></body></html>
Rationale: Generate a javascript comment in front on the CDATA block so that it works in IE and when serving XHTML under a mimetype of HTML.
Some other test related to this use case:
Input:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Rationale: Verify that // are kept.
Input:
<script type="text/javascript">/*<![CDATA[*/ alert("Hello World") /*]]>*/</script>
Expected:
<script type="text/javascript">//<![CDATA[ alert("Hello World") //]]></script>
Rationale: normalize the JS comment
Input:
<script type="text/javascript"> /*<![CDATA[*/ function escapeForXML(origtext) { return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;') .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;') .replace(/\"/g,'&'+'quot;');} /*]]>*/ </script>
Expected:
<script type="text/javascript"> //<![CDATA[ function escapeForXML(origtext) { return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;') .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;') .replace(/\"/g,'&'+'quot;');} //]]> </script>
Rationale: Same as above on a more complex use case.
Input:
"<script><></script>
Expected:
<script>//<![CDATA[<>//]]></script>
Actual:
<script><![CDATA[<>]]></script>
Rationale: Same as above, generate JS comments
Input:
<script><></script>
Expected:
<script>//<![CDATA[<>//]]></script>
Actual:
<script><![CDATA[<>]]></script>
Input:
<style type="text/css"><![CDATA[ a { color: red; } ]]></style>
Expected:
<style type="text/css">//<![CDATA[ a { color: red; } //]]></style>
Actual:
<style type="text/css"><![CDATA[ a { color: red; } ]]></style>
It would be so awesome if HTML Cleaner could support those use cases :)
WDYT? Would you agree to support those use cases?
Note that the implementation is in https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java and you could copy paste that code if you were to agree to support it. However you'd probably want to make it conditional to setting some configuration options.
Thanks a lot
Awesome Scott! I see you've fixed several issues I had raised thanks a lot for that. Eagerly waiting for 2.19 to be out to test it.
Hi Scott. Thanks for working on those. I've upgraded to HC 2.19 and unfortunately I still have the same problem as reported in the comments:
https://sourceforge.net/p/htmlcleaner/bugs/169/#c7e0
https://sourceforge.net/p/htmlcleaner/bugs/169/#6aa0
Also, AFAICS, bug #180 is quite different.
Could it be that this issue was closed by mistake?
Thanks
-Vincent
PS: Unfortunately,because of this issue we haven't been able to upgrade HC on the xwiki project since version 2.16 (this is the last issue that worked for us). After 2.16 a regression was introduced that fails our test, leasding to https://sourceforge.net/p/htmlcleaner/bugs/169/#c7e0 (ie removal of script comments).
Hmm, I'm sure those tests are in the regular HC test suite. I'll take another look
OK, so we have 3 XWiki tests potentially failing here:
The differences are:
In test 1:
1. HC doesn't add any extra newlines before/after CData tokens where they already exist
2. HC always uses block comment style
("/* */")
rather than inline comment style("//")
3. HC doesn't always preserve JS comments where they could also be safe-commenting a CData token. Thats because "//<![CDATA[" can be interpreted either as an inline-commented CData token or a JS comment followed by a CDATA token. HC uses the first interpretation.
In test 2, the only difference is the comments style (2)
Test 3 passes as-is.
So I think the thing that could be considered a potential bug is point 3 about interpreting comments before a CData token - the others are more to do with the style of output. I'll have a re-read of the HTML processor docs at W3 and see if that gives me a clue as to which is the correct interpretation!
Last edit: Scott Wilson 2017-02-07
Hmm the probem is that I really don't remember how I coded the XWikiDOMSerializer back then( https://github.com/xwiki/xwiki-commons/blob/a6976cfccfa4713dcb1ced7ee30cb28475d6f6fc/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/htmlcleaner/XWikiDOMSerializer.java#L57-L57). All I remember is that all was very carefully coded to work for all cases. For example we wanted to support IE.
For exampe if you check this commit in XWiki, you'll see that were using
/**/
but we had to move to use//
to make it work in IE: https://github.com/xwiki/xwiki-commons/commit/bb127b08f640598bfc46f0eba4aa0f5359128991#diff-662b1d8baf4677911842867688c409bfL183 (the issue is https://jira.xwiki.org/browse/XCOMMONS-612).Note that there are more XWiki tests failing but those were the ones using
<script>
(there are also the<style>
ones but it's possible that the issues are the same).I'm also worried that your point 3 (ie not always preserving JS comments) could break IE too but I'm not sure and I don't remember enough what IE expects.
Thanks
Would you also know if what is at https://github.com/xwiki/xwiki-commons/blob/a6976cfccfa4713dcb1ced7ee30cb28475d6f6fc/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/htmlcleaner/XWikiDOMSerializer.java#L265-L265 is now implemented in HC? Thanks
Its different as HC has a dedicated class for CData nodes, and the code for outputting a CData token is part of that class rather than in each serializer.
However I've made all the methods in DomSerializer either public or protected so if there's a specific bit of behaviour you need to change you only need to override createSubnodes().
For really old versions of IE you had to use this monstrosity:
However I think that all current (and previous generation) browsers support the
/* */
commenting style.Point 3 is one where I can't find a definitive rule in the XHTML, HTML and XML processing docs, as effectively this is a case where we're trying to use XML/XHTML-safe code when using the wrong content type :)
So lets see what modern browsers do.
Our input is:
Chrome gives us:
Firefox gives us:
HtmlCleaner gives us:
XWiki version of HtmlCleaner gives us:
So, I think not injecting the JS comment into the CDATA section is the correct behaviour.
Also, not adding extra newlines looks correct.
On the commenting style, I've used block comments to ensure we don't get a change in the script meaning from misplaced newlines, but it does seem like the consensus is retain the original commenting style where one is given (Chrome and FF also retain block style if thats in the original source).
So I think the consensus emerging here is that HC is correct on (1) and (3), but should look at altering its behaviour in (2) to try to retain the original commenting style.
Why do you say that HC seems to be correct for (3)? AFAIU (3) is about "HC doesn't always preserve JS comments where they could also be safe-commenting a CData token. " but the browser tests you've shown above didn't remove JS comments.
Regarding the extra new lines, I'm checking with Guillaume Delhumeau why he added that in https://github.com/xwiki/xwiki-commons/commit/bb127b08f640598bfc46f0eba4aa0f5359128991#diff-acb682688276ae941d59e302a0d34c12R141 :)
In each of the example cases above the CDATA token is preceded by a comment marker. However, in the XWiki example, there is an additional comment marker added after the CDATA token. This is what I mean by "preserving the JS comment" - the XWiki serializer takes the comment marker before CDATA as being separate from the CDATA token, and therefore ends up with two comment markers in the output, while all the other parsers treat the comment marker as being part of the CDATA token.
Trying to make this a bit clearer - it is difficult!
We have this input text:
We can view this semantically as either:
In the first case, we would move the inline comment token inside the CDATA section as its not valid outside, and then safe-comment the CDATA start token when serializing it. Because of this we effectively end up doubling the comment. This is what XWiki does currently:
In the second case, we treat the comment marker as just part of the CDATA token, and serialise it out. This is what Chrome, Firefox and HtmlCleaner do. (The only difference is that HtmlCleaner substitutes block-style commenting for inline-style commenting.):
or
/*<![CDATA[*/
ok so I guess we need to test your input above in IE. I've checked and XWiki supports only IE10+ from now on (http://dev.xwiki.org/xwiki/bin/view/Community/BrowserSupportStrategy).
Guillaume is testing that.
Hi Scott.
Testing with IE10 (XWiki now doesn't support older versions of IE):
See: https://jsfiddle.net/bw7xguhq/
It works without any error.
The same jsFiddle seems to work on IE9 (I can see "Hello World"), but actually I cannot see the fiddle content :/
Well thats odd. Still, at least it works...
Guillaume, what does IE10 gives (see https://sourceforge.net/p/htmlcleaner/bugs/169/?limit=25&page=1#ac39)?
Assuming you ask me to use the Web Developer tool of IE10, I get:
ok Scott, so I've slowly examined all the XWiki unit tests related to CData inside script or style tags and here are the problems I've noticed:
A) HC removes or doesn't add comments where necessary. HC should add
//
comments for script tags in front of CDATA (and/*
comments for style tags) as otherwise JS or CSS engines will choke on the CDATA element. This can be seen easily on https://jsfiddle.net/r1vkrktz/ (if you run it, you'll see that the "test" link is not in red and if you add the comments as on https://jsfiddle.net/7g64s784/ it works).B) HC is overzealous in some cases and adds several CDATA blocks when there should be only one suc block. Example input:
HC gives:
Also note the problem of ordering.
Other input exampl showing the same problem:
HC gives:
So in summary there are still 2 outstanding problems for CDATA handling IMO (including the fact that '//' comments should be used for script tags).
Do you agree? :)
Thanks
D'oh this is my fault. I've been testing using all the serializers apart from DomSerializer. It all works absolutely fine using HtmlSerializer and JDomSerializer, its just DomSerializer that does this..!
Right, duplicating all CDATA test cases using DomSerializer as well as the other serializers...
Thanks for looking into this Scott!
No problem! Hopefully now I know where the problem is, it shouldn't take a long time to fix.
OK, I've aligned all the serializers using the exact same unit tests for CDATA. If you can try a checkout and build from HEAD that should now give you a different result.
ok I've retested and it's much better :)
Remaining issues for XWiki:
//
instead of/**/
for script tags but this is not very important I think. Although I hope that there aren't any tool parsing results of XWiki rendering that would fai because we would change the comment style.createSubnodes()
. However I've just checkedcreateSubnodes()
and it has a lot of code inside. If I need to copy paste all this code in my extension then it's going to be painful to update when you make changes in HC in the future. Would there be a way to have a config option similar tosetDeserializeEntities()
but only for CDATA content? :) Second best option would be a way for me to extend more easily into DomSerializer to add this behavior.Thanks!