Some html entities are get "broken"

Brought to you by: patmoore, scottwilson, vnikic

#194 Some html entities are get "broken"

Milestone: v2.21

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2018-04-24

Created: 2017-10-05

Creator: Thomas Mortagne

Private: No

Many html entities and properties are preserved but at least all entities representing upper cased accented characters are escaped as if they were not recognized as entities.

For example

É

end up cleaned as

É

which is really not nice.

What is weird if that I can see Eacute listed in SpecialEntities class.

Discussion

Thomas Mortagne - 2017-10-05

I meant

For example

<p>É</p>

end up cleaned as

<p>&Eacute;</p>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-10-05

Hi Thomas,

Making the various character encoding and escaping options easier to use has been on the "to-do" list for some time as its not particular clear how you get the desired outcome.

It basically depends on (1) the options for the cleaner and (2) what kind of serializer you are using. (This is important as special entities are very different between XML and HTML.)

If you use the following settings:

setAdvancedXmlEscape(true); setTranslateSpecialEntities(false);

Then using your example, with SimpleHtmlSerializer, the output is the same as the input - the entity reference is preserved with no escaping.

However, if you set:

setAdvancedXmlEscape(false); setTranslateSpecialEntities(false);

Then the HtmlSerializer output also ends up with the entity treated as text, and the ampersand escaped. You usually don't want to do that!

For the XmlSerializer, you only have the choice of either having the entity replaced with unicode, or escaping the ampersand, as the entity reference isn't valid in XML.

Hope that makes sense!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Mortagne - 2017-10-05

You can see how the cleaner is configured to get this behavior on https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/DefaultHTMLCleaner.java#L209

None of setAdvancedXmlEscape and setTranslateSpecialEntities are called.

What makes looks a lot like a bug to me is that whatever is the configuration it does not make much sense to have é properly pass trough and not É.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Mortagne - 2017-10-05

Hmm actually forget that I'm looking at the serializer code and it might come from a weird hack it's doing actually.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Mortagne - 2017-10-05

Yep I confirm you can close this issue. Sorry for the noise...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2018-04-24

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.