Hi, on XWiki we've tried to upgrade from HC 2.16 to 2.22 and had to rollback during the release because of a regression we noticed:
Input:
<div foo="aaa"bbb&ccc>ddd<eee">content</div>
Generates:
<div foo="aaa"bbb&ccc>ddd<eee">content</div>
Whereas we were expecting the following (which is what we get with HC 2.16):
<div foo="aaa"bbb&ccc>ddd<eee">content</div>
Thanks!
For reference this is out issue on the XWiki side: https://jira.xwiki.org/browse/XCOMMONS-1164
Hi Vincent,
I've tried to replicate this in 2.22 but I don't get the same output. I see the same output you expect to find in your test, that the entities are preserved in the output attribute.
Hi Scott! You're as fast as usal :)
I would have thought about some XWiki-related thing but for the fact that keeping the same code and just changing the HC version in the POM changes the result so this clearly suggests a change in behavior.
Maybe HC between 2.16 and 2.22 changed some default config values?
Note that "setAllowHtmlInsideAttributes(false)" is the default and we're not setting it (I've set it to test and same result). FTR this is the config we have: https://github.com/xwiki/xwiki-commons/blob/d1657b0da41bc6fdca0cefd6e03b2a3705b17cec/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/DefaultHTMLCleaner.java#L211
Also we're using our own DOM Serializer () so maybe something has changed from what is returned by the clean() method like not escaping attribute values and SimpleHtmlSerializer has also been modified to do the escaping?
I think it relates to this discussion and the associated change (commit 521):
https://sourceforge.net/p/htmlcleaner/discussion/637246/thread/fc7f2d65/?limit=25#b65b
If you look at the last change to DomSerializer, this should probably be replicated in XWikiDomSerializer:
Thanks a lot Scott for this code/diff :)
I've tried it but it doesn't work though.
I'll need to look at the commit 521 a bit later. Thx
Ok I've done more tests:
aaa"bbb&ccc>ddd<eeeaaa"bbb&ccc>ddd<eeeAny idea?
Thanks
Yes, HC will deserialize entities in attribute text when creating tagnodes, and the serializer will then apply the appropriate rules for entities depending on how its configured. (This is a change from earlier versions of HC which passed entities in attributes through untouched, and then applied escaping - there was a use case from a user for this change).
So when I inspect the foo attribute for the DIV Tagnode,, I see as you do:
aaa"bbb&ccc>ddd<eeeBut when I inspect the attribute value of the DIV element in the Document created by DomSerializer I see:
aaa"bbb&ccc>ddd<eeeHowever, if I initialize the DomSerializer with escapeXml=false then I see:
aaa"bbb&ccc>ddd<eeeThis is because of this line in DomSerializer:
So I suspect thats the issue? Is there a reason that escapeXml is turned off in XWiki?
So a fix could be to include this line in XWikiDomSerializer:
... or to initalise the DomSerializer without setting escapeXml=false.
Does that help?
Yes it helps,, thanks. I hadn't realized we were initializing the XWikiDOMSerialize with escapeXml=false (had forgotten).
I tried putting it to true and indeed it fixes this issue, but breaks 3 other tests (which is why it was off before ;)):
Problem 1
Generates
<html><head></head><body><p>É</p></body></html>instead of<html><head></head><body><p>É</p></body></html>.Problem 2
Generates
<html><head></head><body><p>"&**notbold**<notag> </p></body></html>instead of<html><head></head><body><p>"&**notbold**<notag> </p></body></html>Problem 3
Generates
<html><head></head><body><pre>alert("foo")</pre></body></html>instead of<html><head></head><body><pre>alert("foo")</pre></body></html>.Do these tests seem valid to you?
Thx
And I believe you're confirming that the cleaner has changed behavior since 2.16 and that it's normal and wanted that it doesn't escape attribute values any more, correct?
Note that if I use DOMSerializer instead of XWikiDOMSerializer I also get these 3 problems (and 2 more failing tests related to CDATA).
Yes that's correct - its a change since 2.21.
Some of the other tests you have there relate to preserving entities rather than replacing them with unicode; I think thats something that should be set in cleaner properties rather than by turning off XML escaping.
... for example, if you want to retain the
Éentity rather than replace it with the unicode character É, you'd settranslateSpecialEntities=false.Yes thanks for that hint, that fixes Problem 1 (and Problem 2 for some reason).
I now have Problem 3 + a new one:
Problem 4
This time it's the second assert that fails with
<html><head></head><body><p>"&</p></body></html>instead of<html><head></head><body><p>"&</p></body></html>Thanks a lot for your help, much appreciated. I haven't touched this code for a while and I'm a bit rusty on it and I don't remember a lot of things....
Yes, it gets pretty complicated when you want some entities and not others...
AFAIR what we wanted was XHTML so it's normal to escape ampersand (since it's one of the 5 special characters in XML) but the quote doesn't need to be escaped in XHTML.
So are you saying there's no way to escape only what's needed for XHTML and translateSpecialEntities=true will escape more than needed?
Last edit: Vincent Massol 2018-06-21
Hmm, SimpleHtmlSerializer is a bit aggressive at replacing characters with their entity equivalents. Its because the same code is used for escaping content in attributes (where we need to be strict) and elements (where convention is to be relatively loose).
Its all in the serializer as far as I can tell - the actual TagNode tree has the 'right' information, I think the unwanted behaviour is in the serializer.
Strictly speaking the serialization is correct with regards to the validity of the output, its just being a little overzealous in its use of
".1) I'm also getting this now:
whereas before, I had:
(

is a line feed)Does that seem normal to you?
2) Should I open an issue for the overzealous quote escape?
Thanks
For (1) do you mean that, where before you had a line feed character as a unicode entity, you are now getting a standard line feed character?
For (2) yes I think it needs its own bug report. Its not an error as such, but its not what many users expect from a tool such as HC.
Yes correct.
When RecognizeUnicodeCharacters is set to True, HC will replace any unicode entities with unicode characters. If set to false, it'll leave the unicode entity references in place. (Earlier releases of HC were less consistent in applying this.)
You may want to check if this causes any issues in further processing, but any XML/XHTML processing should treat the newline character and the newline entity reference as being equivalent.
I've gone ahead yesterday and applied the HC 2.22 upgrade with the change of tests. I'm crossing fingers now since it's hard to foresee what the behavior changes will lead to.
Thx
Fingers crossed here too!
I have this test for the 2.23 snapshot based on the original bug report, which is now passing: