Many html entities and properties are preserved but at least all entities representing upper cased accented characters are escaped as if they were not recognized as entities.
For example
É
end up cleaned as
É
which is really not nice.
What is weird if that I can see Eacute listed in SpecialEntities class.
I meant
For example
<p>É</p>end up cleaned as
<p>&Eacute;</p>Hi Thomas,
Making the various character encoding and escaping options easier to use has been on the "to-do" list for some time as its not particular clear how you get the desired outcome.
It basically depends on (1) the options for the cleaner and (2) what kind of serializer you are using. (This is important as special entities are very different between XML and HTML.)
If you use the following settings:
Then using your example, with SimpleHtmlSerializer, the output is the same as the input - the entity reference is preserved with no escaping.
However, if you set:
Then the HtmlSerializer output also ends up with the entity treated as text, and the ampersand escaped. You usually don't want to do that!
For the XmlSerializer, you only have the choice of either having the entity replaced with unicode, or escaping the ampersand, as the entity reference isn't valid in XML.
Hope that makes sense!
You can see how the cleaner is configured to get this behavior on https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/DefaultHTMLCleaner.java#L209
None of setAdvancedXmlEscape and setTranslateSpecialEntities are called.
What makes looks a lot like a bug to me is that whatever is the configuration it does not make much sense to have
éproperly pass trough and notÉ.Hmm actually forget that I'm looking at the serializer code and it might come from a weird hack it's doing actually.
Yep I confirm you can close this issue. Sorry for the noise...