HtmlCleaner / Bugs / #118 HTML always translating special entities

Scott Wilson - 2014-05-19

Thanks Seanster,

I suspect it was related to #107. I've added a "don't mess with HTML entities when outputting HTML" branch to the Utils.escapeXml logic that seems to fix this issue.

(I'll close the issue when I also fix the missing command line options and help text)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-05-19

status: open --> open-accepted

Group: v 2.8 --> v 2.9
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-05-21

I should mention this scenario for apostrophes in case it's been missed.

original first, cleaned second:

<img class="displayS" onmouseover="display(this, '/images/site_images/product_usage.jpg');displayCaption(this, ' ')" src="/images/site_images/product_usage.jpg" alt=""/> <img class="displayS" onmouseover="display(this, '/images/site_images/product_usage.jpg');displayCaption(this, ' ')" src="/images/site_images/product_usage.jpg" alt="" />

I've never figured out how to use anything other than the jar file. I'll try to run the current code and see what happens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-05-21

(btw the content in the previous post was from v2.7)

Finally figured out building my own jar file. Here's what I get from v2.9 (as of now):

<img class="displayS" onmouseover="display(this, '/images/products.jpg');displayCaption(this, 'Â ')" src="/images/products.jpg" alt="" />

In addition to the apostrophe problem you can see the nonbreakable space is still converted and output literally.

Here's my command line options:

cmd_options = "outputtype=htmlcompact advancedxmlescape=true specialentities=false unicodechars=false omitcomments=true omitxmldecl=true omitdoctypedecl=true omithtmlenvelope=true useemptyelementtags=true"
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

This is my unit test - as you can see I set AdvancedXmlEscape to False:

@Test
public void entities()  throws IOException{
    String input = "<html><head></head><body><p>&nbsp;&pound;</p></body></html>";
    String output = "<html><head></head><body><p>&nbsp;&pound;</p></body></html>";
    cleaner.getProperties().setAdvancedXmlEscape(false);
    cleaner.getProperties().setAddNewlineToHeadAndBody(false);

    TagNode cleaned = cleaner.clean(input);
    Serializer ser = new SimpleHtmlSerializer(cleaner.getProperties());
    String xout = "";
    StringWriter writer = new StringWriter();
    ser.serialize(cleaned, writer);
    assertEquals(input, writer.toString());
}

Scott Wilson - 2014-05-22

On the apostrophes issue, the ' character and ' are equivalent in this context, but ' is generally considered the more correct way of doing it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-05-22

Setting AdvancedXmlEscape to false seems to be the opposite of what I want to happen. If I turn it off then I get my nonbreaking space left intact (not sure why) but now &amp's all over the place:

raw first cleaned second ’ &#8217; <param name="flashvars" value="file=/resources/videos/Ohms_Law/Ohms-Law.flv&autostart=True" /> <param name="flashvars" value="file=/resources/videos/Ohms_Law/Ohms-Law.flv&autostart=True" /> <div class="videoDiv" id="inline387_b" onclick="openVideo('#inline387_b', '#inline_example387', '668', '640', 'Demonstration of the Ohm\'s Law.');"> <div class="videoDiv" id="inline387_b" onclick="openVideo('#inline387_b', '#inline_example387', '668', '640', 'Demonstration of the Ohm\'s Law.');">

Am I misunderstanding the options? Let me know if it is me who is wrong and I'll just hack the source code for myself. I've been using v2.2 nearly forever because it would generally leave this stuff alone but it has other drawbacks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-05-23

Ah I see you point - I've made another tweak to the method, give it a go now

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-05-23

With AdvancedXmlEscape true:
raw
cleaned

  [c2 a0]

With AdvancedXmlEscape false:

raw cleaned       &#160;

I'm looking at the escapeXml source trying to understand the logic and it appears as though it's not designed to do what I want it to do. I want it to recognize unicode so it can leave it alone (why I keep wanting AdvancedXmlEscape true). That doesn't seem to be an option in the code path.

The special 'else' case you added for html doesn't do the extra step of using convertToUnicode to find out if it's a special entity and leave it alone. I think that's why I'm getting the ampersands escaped.

I also think it would be cool if it could convert the escaped numeric special entities to html codes.

raw cleaned         ’ ’

I'll fiddle with the code tonight and see if I can get this to happen.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-05-24

I was successful with my changes. I'd like to keep this bug on topic so I'll post those details in the forums maybe more appropriate for a discussion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-08-13

Group: v 2.9 --> v 2.10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-10-31

Group: v 2.10 --> v 2.11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-10-31

Thanks for keeping this open. I have to go back and diff my edits and then re-learn how that horrible function operates so I can narrow down the relevant changes to submit back to you. I did get very good html out of it and I want to get that back into the codebase asap.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2014-11-26

Ok just pulled the latest svn version 377 and merged my older html changes and created a nice clean patch file as attached to this ticket. Just watch out for CRLF conversion, SVN doesn't seem set up for it and I have to strip CR's myself like a hillbilly.

In SpecialEntities.java I simply commented out the "apos" entry. I didn't want apostrophes touched in any way and this was the quickest way to do it. Unfortunately whoever added it there had the same mentality haha.

I don't have enough experience with the source code to be qualified to track down the special cases where the html serializer needs this gone. This will have to be done in order to support both html and xml output.

In Utils.java I modified the escapeXml function. I jump in when there's an ampersand found and isHtmlOutput is enabled.

If it's a special entity we leave it in place. We should skip the pointer ahead to the end of the entity before looping back to process input but that line is commented out. I'm not sure exactly how much needs to be skipped. I'm sure you'd know so maybe you can fix that.

If the second character is a hash # then we get the entity code and convert it to it's common html name. I added a special function "convert_To_Entity_Name" just for this purpose. This is not optional as long as isHtmlOutput is enabled. It should be optional.

Otherwise we convert the amersand to &amp.

After all that, there's the situation where you check to see if the character itself is a unicode special entity. In this situation I added an html specific result:

isDomCreation? code.getHtmlString() : code.getEscapedValue() In the usual case like a unicode nonbreaking space character, it is converted to something like " "

I think the whole "convert_To_Entity_Name" function was just a copy of convertToUnicode and bits and pieces stolen from wherever. I have no memory of writing it but it seems to work.

htmlcleaner_118.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-01-19

Eclipse really, really hates your patch file :)

How about you just attach Utils.java and I'll take it from there?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-05-12

Group: v 2.11 --> v2.12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-05-15

Group: v2.12 --> v2.13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-05-26

Hi,
From the above it seems like there is no way to prevent DomSerializer from changing special entities like ä

Is that still the case or is there a work around?

Thanks
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-05-26

Hi Martin,

Seanster provided a patch to try and improve the special entities conversion, but I haven't been able to apply it as its not in a format Eclipse likes. If you can apply it to your local build and it fixes your problem, then please do export the patch and I'll try and get it into a release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-05-27

I had to do a quick fix for this because it affected production code. I mangled entities before processing so they were not recognised and then unmangled them after processing.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-07-01

Group: v2.13 --> v2.14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-24

Group: v2.14 --> v2.15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-10-01

Group: v2.15 --> v2.16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2015-10-07

Hi. Here's my latest patch against the latest 2.16-snapshot r436. Sorry for the delay, I didn't get the message that the patch was not working until far too late. I hope it will work this time. With 5 revisions passing by it's quite a challenge to dig in there and re-learn how this all works.

I strongly suspect the svn end-of-line configuration is incorrect on the server side. My tests suggest the server is thinking it should be CR when it's actually CR-LF. I can't get my linux svn client to convert correctly.

The files I am checking out all have CR-LF line endings. I managed to leave those intact for my latest edits. This patch file definitely contains CR-LF line endings. I tested the patch and it worked fine:

svn patch patchfile

This patch affects ONLY HTML serializers and does 4 main things:

if an html character entity is encountered like &nbsp or &shy just leave it alone.

if an NCR is found that is in the entity list, convert it to the html entity name
ie. – becomes &ndash

if an apostrophe character is found, leave it as-is, don't convert it to the entity &apos
(I know this is debatable but I really neeeeed it like that or it wreaks havoc)

any unicode character matched in the entity list is converted to the html entity name
ie. an actual nonbreaking space is converted to &nbsp, a quote is converted to &quot

I created a new function called convert_To_Entity_Name that is nearly a copy of convertToUnicode.
It could be simplified because it is only called in the case of &# discovered in the input stream, but I left it able to handle other cases if I needed to do more later.

There is some logic that is irrelevant because isDomCreation is always false for HTML, and transSpecialEntitiesToNCR is not yet exposed to the command line.

It may not be perfect for everyone using HTML serializers but I believe it's going in the right direction.

Last edit: Seanster 2015-10-07

patch-htmlcleaner-2.16-snapshot-r436-html-bug118.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-10-23

Patch applied! It'll be in release 2.16

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HTML always translating special entities

Group

Searches

Help

#118 HTML always translating special entities

Discussion