Hi. Now that the HTML Serializers are back I've tried them out but I can't get them to just leave special entities alone.
Input: or £
Output if (translate special entities) true:   £
Output if (translate special entities) false: (space) or £
There's no scenario where I can get it to output or £
None of the other command line options make any difference.
Related to ticket #107 maybe? escapeXml() needs adjustment? I don't have any experience with Version 2.7 yet (first version with the HTML Serializers back on the command line) but for this example it works as expected.
Also related,
transSpecialEntitiesToNCR is not exposed on the command line but should have been brought back when the html serializers were reinstated. The command line help text should list the html serializers too.
Thanks Seanster,
I suspect it was related to #107. I've added a "don't mess with HTML entities when outputting HTML" branch to the Utils.escapeXml logic that seems to fix this issue.
(I'll close the issue when I also fix the missing command line options and help text)
I should mention this scenario for apostrophes in case it's been missed.
original first, cleaned second:
I've never figured out how to use anything other than the jar file. I'll try to run the current code and see what happens.
(btw the content in the previous post was from v2.7)
Finally figured out building my own jar file. Here's what I get from v2.9 (as of now):
In addition to the apostrophe problem you can see the nonbreakable space is still converted and output literally.
Here's my command line options:
cmd_options = "outputtype=htmlcompact advancedxmlescape=true specialentities=false unicodechars=false omitcomments=true omitxmldecl=true omitdoctypedecl=true omithtmlenvelope=true useemptyelementtags=true"
This is my unit test - as you can see I set AdvancedXmlEscape to False:
On the apostrophes issue, the
'character and'are equivalent in this context, but'is generally considered the more correct way of doing it.Setting AdvancedXmlEscape to false seems to be the opposite of what I want to happen. If I turn it off then I get my nonbreaking space left intact (not sure why) but now &'s all over the place:
Am I misunderstanding the options? Let me know if it is me who is wrong and I'll just hack the source code for myself. I've been using v2.2 nearly forever because it would generally leave this stuff alone but it has other drawbacks.
Ah I see you point - I've made another tweak to the method, give it a go now
With AdvancedXmlEscape true:
raw
cleaned
With AdvancedXmlEscape false:
I'm looking at the escapeXml source trying to understand the logic and it appears as though it's not designed to do what I want it to do. I want it to recognize unicode so it can leave it alone (why I keep wanting AdvancedXmlEscape true). That doesn't seem to be an option in the code path.
The special 'else' case you added for html doesn't do the extra step of using convertToUnicode to find out if it's a special entity and leave it alone. I think that's why I'm getting the ampersands escaped.
I also think it would be cool if it could convert the escaped numeric special entities to html codes.
I'll fiddle with the code tonight and see if I can get this to happen.
I was successful with my changes. I'd like to keep this bug on topic so I'll post those details in the forums maybe more appropriate for a discussion.
Thanks for keeping this open. I have to go back and diff my edits and then re-learn how that horrible function operates so I can narrow down the relevant changes to submit back to you. I did get very good html out of it and I want to get that back into the codebase asap.
Ok just pulled the latest svn version 377 and merged my older html changes and created a nice clean patch file as attached to this ticket. Just watch out for CRLF conversion, SVN doesn't seem set up for it and I have to strip CR's myself like a hillbilly.
In SpecialEntities.java I simply commented out the "apos" entry. I didn't want apostrophes touched in any way and this was the quickest way to do it. Unfortunately whoever added it there had the same mentality haha.
I don't have enough experience with the source code to be qualified to track down the special cases where the html serializer needs this gone. This will have to be done in order to support both html and xml output.
In Utils.java I modified the escapeXml function. I jump in when there's an ampersand found and isHtmlOutput is enabled.
If it's a special entity we leave it in place. We should skip the pointer ahead to the end of the entity before looping back to process input but that line is commented out. I'm not sure exactly how much needs to be skipped. I'm sure you'd know so maybe you can fix that.
If the second character is a hash # then we get the entity code and convert it to it's common html name. I added a special function "convert_To_Entity_Name" just for this purpose. This is not optional as long as isHtmlOutput is enabled. It should be optional.
Otherwise we convert the amersand to &.
After all that, there's the situation where you check to see if the character itself is a unicode special entity. In this situation I added an html specific result:
I think the whole "convert_To_Entity_Name" function was just a copy of convertToUnicode and bits and pieces stolen from wherever. I have no memory of writing it but it seems to work.
Eclipse really, really hates your patch file :)
How about you just attach Utils.java and I'll take it from there?
Hi,
From the above it seems like there is no way to prevent DomSerializer from changing special entities like
äIs that still the case or is there a work around?
Thanks
Martin
Hi Martin,
Seanster provided a patch to try and improve the special entities conversion, but I haven't been able to apply it as its not in a format Eclipse likes. If you can apply it to your local build and it fixes your problem, then please do export the patch and I'll try and get it into a release.
I had to do a quick fix for this because it affected production code. I mangled entities before processing so they were not recognised and then unmangled them after processing.
Hi. Here's my latest patch against the latest 2.16-snapshot r436. Sorry for the delay, I didn't get the message that the patch was not working until far too late. I hope it will work this time. With 5 revisions passing by it's quite a challenge to dig in there and re-learn how this all works.
I strongly suspect the svn end-of-line configuration is incorrect on the server side. My tests suggest the server is thinking it should be CR when it's actually CR-LF. I can't get my linux svn client to convert correctly.
The files I am checking out all have CR-LF line endings. I managed to leave those intact for my latest edits. This patch file definitely contains CR-LF line endings. I tested the patch and it worked fine:
svn patch patchfile
This patch affects ONLY HTML serializers and does 4 main things:
ie. – becomes &ndash
(I know this is debatable but I really neeeeed it like that or it wreaks havoc)
ie. an actual nonbreaking space is converted to  , a quote is converted to "
I created a new function called convert_To_Entity_Name that is nearly a copy of convertToUnicode.
It could be simplified because it is only called in the case of &# discovered in the input stream, but I left it able to handle other cases if I needed to do more later.
There is some logic that is irrelevant because isDomCreation is always false for HTML, and transSpecialEntitiesToNCR is not yet exposed to the command line.
It may not be perfect for everyone using HTML serializers but I believe it's going in the right direction.
Last edit: Seanster 2015-10-07
Patch applied! It'll be in release 2.16