Menu

#194 Some html entities are get "broken"

v2.21
closed
nobody
None
5
2018-04-24
2017-10-05
No

Many html entities and properties are preserved but at least all entities representing upper cased accented characters are escaped as if they were not recognized as entities.

For example

É

end up cleaned as

É

which is really not nice.

What is weird if that I can see Eacute listed in SpecialEntities class.

Discussion

  • Thomas Mortagne

    Thomas Mortagne - 2017-10-05

    I meant

    For example

    <p>&Eacute;</p>

    end up cleaned as

    <p>&amp;Eacute;</p>

     
  • Scott Wilson

    Scott Wilson - 2017-10-05

    Hi Thomas,

    Making the various character encoding and escaping options easier to use has been on the "to-do" list for some time as its not particular clear how you get the desired outcome.

    It basically depends on (1) the options for the cleaner and (2) what kind of serializer you are using. (This is important as special entities are very different between XML and HTML.)

    If you use the following settings:

        setAdvancedXmlEscape(true);
        setTranslateSpecialEntities(false);
    

    Then using your example, with SimpleHtmlSerializer, the output is the same as the input - the entity reference is preserved with no escaping.

    However, if you set:

        setAdvancedXmlEscape(false);
        setTranslateSpecialEntities(false);
    

    Then the HtmlSerializer output also ends up with the entity treated as text, and the ampersand escaped. You usually don't want to do that!

    For the XmlSerializer, you only have the choice of either having the entity replaced with unicode, or escaping the ampersand, as the entity reference isn't valid in XML.

    Hope that makes sense!

     
  • Thomas Mortagne

    Thomas Mortagne - 2017-10-05

    Hmm actually forget that I'm looking at the serializer code and it might come from a weird hack it's doing actually.

     
  • Thomas Mortagne

    Thomas Mortagne - 2017-10-05

    Yep I confirm you can close this issue. Sorry for the noise...

     
  • Scott Wilson

    Scott Wilson - 2018-04-24
    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB