I've modified the tokenizer so that it no longer breaks up content within CDATA sections when it encounters a '<' or '>' - so a content section containing CDATA will end up as a single ContentNode rather than multiple when the enclosed CDATA has those characters in it. I've also merged in the improvements to DomSerializer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Running org.xwiki.xml.internal.html.DefaultHTMLCleanerTestscriptAndCData(org.xwiki.xml.internal.html.DefaultHTMLCleanerTest) Time elapsed: 0.031 sec <<< ERROR!org.jdom.IllegalDataException: The data "//<![CDATA[alert("Hello World")// ]]>" is not legal for a JDOM CDATA section: CDATA cannot internally contain a CDATA ending delimiter (]]>). at org.jdom.CDATA.setText(CDATA.java:121) at org.jdom.CDATA.<init>(CDATA.java:96) at org.jdom.DefaultJDOMFactory.cdata(DefaultJDOMFactory.java:97) at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:373) at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360) at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360) at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:360) at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:173) at org.jdom.input.DOMBuilder.build(DOMBuilder.java:138) at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:194) at org.xwiki.xml.html.HTMLUtils.toString(HTMLUtils.java:180) at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.assertHTML(DefaultHTMLCleanerTest.java:279) at org.xwiki.xml.internal.html.DefaultHTMLCleanerTest.scriptAndCData(DefaultHTMLCleanerTest.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.xwiki.test.ComponentManagerRule$1.evaluate(ComponentManagerRule.java:44) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)
Ah got it, the content nodes contain CDATA start/end tokens
I've created a CData token class so we can process them separately from regular Content Node instances and treat them differently in the serializers - lets see if that works out better
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, I've committed the latest changes. I've modified the HtmlTokenizer to recognise CData sections and create a CData instance rather than a ContentNode instance. These can then be output separately from regular ContentNodes, with or without safe start/end tokens. Give it a go, and see how it breaks this time ;)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah OK. So in this case you're expecting there to be a default DOCTYPE added where there is none specified in the input?
I've also had a further look into this... part of the problem is that XMLSerializer tries to wrap the whole contents of a <script> tag in CDATA declarations - even if it contains child CDATA. However, nested CDATA is not valid! </p>
<p>One solution is to individually wrap ContentNode instances inside <script> and <style> with CDATA start and end tokens, even though this results in some pretty ugly (but valid) results. Another is to try and cleverly collapse it into a single set of CDATA tokens around the whole of the content without any nesting (even where it exists in the input). </p></script>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
sigh the new SF site (aka Apache Allura) is not without its bugs either
OK, what I was on about was that at present XmlSerializer wraps the content of <script> and <style> in CDATA tokens, even if that includes CDATA content (and CDATA cannot be nested). Anyway, I think I'm making progress at fixing this gnarly problem - thanks for the test cases, they're a big help.</p></script>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
See also https://sourceforge.net/tracker/?func=detail&aid=2761963&group_id=183053&atid=903696
One problem is in HtmlTokenizer, in content() which breaks if the "<" char is found. However when inside a CDATA block this shouldn't happen.
Example:
"<script type=\"text/javascript\">\n"
+ "// <![CDATA[\n"
+ "function escapeForXML(origtext) {\n"
+ " return origtext.replace(/\\&/g,'&'+'amp;').replace(/</g,'&'+'lt;')\n"
+ " .replace(/>/g,'&'+'gt;').replace(/\'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');"
+ "}\n"
+ "// ]]>\n"
+ "</script>"
This generates 3 ContentToken but it should generate only one.
Then the Serializer fail since it creates 3 CDATA sections if useCData is true:
} else if (item instanceof ContentToken) {
String nodeName = element.getNodeName();
ContentToken contentToken = (ContentToken) item;
String content = contentToken.getContent();
boolean specialCase =
props.isUseCdataForScriptAndStyle()
&& ("script".equalsIgnoreCase(nodeName) || "style".equalsIgnoreCase(nodeName));
if (escapeXml && !specialCase) {
content = Utils.escapeXml(content, props, true);
}
element.appendChild(specialCase ? document.createCDATASection(content) : document
.createTextNode(content));
It'll also fail if useCdata is set to true and there are any "<" char in the script element.
It would really be cool if HTMLCleaner could fix CDATA handling in DomSerializer for 2.6 :)
FYI here's the code that makes it work nicely:
https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java
It would be even cooler if someone submitted a patch :)
This is what I've done Scott. I've provided the full solution! You just need to replace your DOMSerializer by this one:
https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java
:)
Thanks Vincent - and sorry for the long delay in replying :)
We'll definitely look at merging in the improved DomSerializer from XWiki
Thanks Scott. I'm eagerly waiting for it :)
Merge in for 2.7
I've modified the tokenizer so that it no longer breaks up content within CDATA sections when it encounters a '<' or '>' - so a content section containing CDATA will end up as a single ContentNode rather than multiple when the enclosed CDATA has those characters in it. I've also merged in the improvements to DomSerializer.
woohoo :)
Heheh, don't celebrate too soon :)
I've not merged in one part of the improvements as I'm not entirely sure how they'll play out:
(HC 2.7-SNAPSHOT)
vs.
(XWIKI)
Though as content nodes are no longer being split, maybe this isn't such a big deal?
Ok I've just tested it on our project at https://github.com/xwiki/xwiki-commons/tree/master/xwiki-commons-core/xwiki-commons-xml and unfortunately it fails :(
This test is:
Ah got it, the content nodes contain CDATA start/end tokens
I've created a CData token class so we can process them separately from regular Content Node instances and treat them differently in the serializers - lets see if that works out better
OK, I've committed the latest changes. I've modified the HtmlTokenizer to recognise CData sections and create a CData instance rather than a ContentNode instance. These can then be output separately from regular ContentNodes, with or without safe start/end tokens. Give it a go, and see how it breaks this time ;)
(oh, and thank you very much Vincent for your patience in helping me fix this)
I've retested with your latest changes and it's still not good. Actually it seems you've brought another regression to https://sourceforge.net/tracker/index.php?func=detail&aid=2062318&group_id=183053&atid=903696 which was working before your last changes.
So right now I get the following 2 problems:
* missing DOCTYPE
* removed comments
screenshot of diff: https://www.evernote.com/shard/s119/sh/62453483-dc9e-4dcd-85e6-76e343d88990/b4c98a06f8a68a23eab51b85db97a685
Thanks
OK thats odd; the DOCTYPE is correctly picked up by HtmlTokenizer but it isn't in the serialization output even with OmitDoctypeDeclaration==false.
(Adding more test cases to try and pick up these kind of regressions...)
I've added another test for this one - I get slightly different results:
This is my test:
and assertHTML is just:
~~~~~~
Ah OK. So in this case you're expecting there to be a default DOCTYPE added where there is none specified in the input?
I've also had a further look into this... part of the problem is that XMLSerializer tries to wrap the whole contents of a <script> tag in CDATA declarations - even if it contains child CDATA. However, nested CDATA is not valid! </p> <p>One solution is to individually wrap ContentNode instances inside <script> and <style> with CDATA start and end tokens, even though this results in some pretty ugly (but valid) results. Another is to try and cleverly collapse it into a single set of CDATA tokens around the whole of the content without any nesting (even where it exists in the input). </p></script>
missing end of sentence in your last comment :)
sigh the new SF site (aka Apache Allura) is not without its bugs either
OK, what I was on about was that at present XmlSerializer wraps the content of <script> and <style> in CDATA tokens, even if that includes CDATA content (and CDATA cannot be nested). Anyway, I think I'm making progress at fixing this gnarly problem - thanks for the test cases, they're a big help.</p></script>