Menu

#118 HTML always translating special entities

v2.16
closed-fixed
nobody
None
5
2021-04-20
2014-05-17
Seanster
No

Hi. Now that the HTML Serializers are back I've tried them out but I can't get them to just leave special entities alone.

Input:   or £

Output if (translate special entities) true:   £
Output if (translate special entities) false: (space) or £

There's no scenario where I can get it to output   or £
None of the other command line options make any difference.

Related to ticket #107 maybe? escapeXml() needs adjustment? I don't have any experience with Version 2.7 yet (first version with the HTML Serializers back on the command line) but for this example it works as expected.

Also related,
transSpecialEntitiesToNCR is not exposed on the command line but should have been brought back when the html serializers were reinstated. The command line help text should list the html serializers too.

Discussion

1 2 > >> (Page 1 of 2)
  • Scott Wilson

    Scott Wilson - 2014-05-19

    Thanks Seanster,

    I suspect it was related to #107. I've added a "don't mess with HTML entities when outputting HTML" branch to the Utils.escapeXml logic that seems to fix this issue.

    (I'll close the issue when I also fix the missing command line options and help text)

     
  • Scott Wilson

    Scott Wilson - 2014-05-19
    • status: open --> open-accepted
    • Group: v 2.8 --> v 2.9
     
  • Seanster

    Seanster - 2014-05-21

    I should mention this scenario for apostrophes in case it's been missed.

    original first, cleaned second:

    <img class="displayS" onmouseover="display(this, '/images/site_images/product_usage.jpg');displayCaption(this, '&nbsp;')" src="/images/site_images/product_usage.jpg" alt=""/>
    <img class="displayS" onmouseover="display(this, &apos;/images/site_images/product_usage.jpg&apos;);displayCaption(this, &apos;&nbsp;&apos;)" src="/images/site_images/product_usage.jpg" alt="" />
    

    I've never figured out how to use anything other than the jar file. I'll try to run the current code and see what happens.

     
  • Seanster

    Seanster - 2014-05-21

    (btw the content in the previous post was from v2.7)

    Finally figured out building my own jar file. Here's what I get from v2.9 (as of now):

    <img class="displayS" onmouseover="display(this, &apos;/images/products.jpg&apos;);displayCaption(this, &apos; &apos;)" src="/images/products.jpg" alt="" />
    

    In addition to the apostrophe problem you can see the nonbreakable space is still converted and output literally.

    Here's my command line options:

    cmd_options = "outputtype=htmlcompact advancedxmlescape=true specialentities=false unicodechars=false omitcomments=true omitxmldecl=true omitdoctypedecl=true omithtmlenvelope=true useemptyelementtags=true"

     
  • Scott Wilson

    Scott Wilson - 2014-05-22

    This is my unit test - as you can see I set AdvancedXmlEscape to False:

    @Test
    public void entities()  throws IOException{
        String input = "<html><head></head><body><p>&nbsp;&pound;</p></body></html>";
        String output = "<html><head></head><body><p>&nbsp;&pound;</p></body></html>";
        cleaner.getProperties().setAdvancedXmlEscape(false);
        cleaner.getProperties().setAddNewlineToHeadAndBody(false);
    
        TagNode cleaned = cleaner.clean(input);
        Serializer ser = new SimpleHtmlSerializer(cleaner.getProperties());
        String xout = "";
        StringWriter writer = new StringWriter();
        ser.serialize(cleaned, writer);
        assertEquals(input, writer.toString());
    }
    
     
  • Scott Wilson

    Scott Wilson - 2014-05-22

    On the apostrophes issue, the ' character and &apos; are equivalent in this context, but &apos; is generally considered the more correct way of doing it.

     
  • Seanster

    Seanster - 2014-05-22

    Setting AdvancedXmlEscape to false seems to be the opposite of what I want to happen. If I turn it off then I get my nonbreaking space left intact (not sure why) but now &amp's all over the place:

    raw first
    cleaned second
    
    &#8217;
    &amp;#8217;
    
    <param name="flashvars" value="file=/resources/videos/Ohms_Law/Ohms-Law.flv&autostart=True" />
    <param name="flashvars" value="file=/resources/videos/Ohms_Law/Ohms-Law.flv&amp;autostart=True" />
    
    <div class="videoDiv" id="inline387_b" onclick="openVideo('#inline387_b', '#inline_example387', '668', '640', 'Demonstration of the Ohm\'s Law.');">
    <div class="videoDiv" id="inline387_b" onclick="openVideo(&apos;#inline387_b&apos;, &apos;#inline_example387&apos;, &apos;668&apos;, &apos;640&apos;, &apos;Demonstration of the Ohm\&apos;s Law.&apos;);">
    

    Am I misunderstanding the options? Let me know if it is me who is wrong and I'll just hack the source code for myself. I've been using v2.2 nearly forever because it would generally leave this stuff alone but it has other drawbacks.

     
  • Scott Wilson

    Scott Wilson - 2014-05-23

    Ah I see you point - I've made another tweak to the method, give it a go now

     
  • Seanster

    Seanster - 2014-05-23

    With AdvancedXmlEscape true:
    raw
    cleaned
     
     

    &#160;
    [c2 a0]
    

    With AdvancedXmlEscape false:

    raw
    cleaned
    &nbsp;
    &nbsp;
    
    &#160;
    &amp;#160;
    

    I'm looking at the escapeXml source trying to understand the logic and it appears as though it's not designed to do what I want it to do. I want it to recognize unicode so it can leave it alone (why I keep wanting AdvancedXmlEscape true). That doesn't seem to be an option in the code path.

    The special 'else' case you added for html doesn't do the extra step of using convertToUnicode to find out if it's a special entity and leave it alone. I think that's why I'm getting the ampersands escaped.

    I also think it would be cool if it could convert the escaped numeric special entities to html codes.

    raw
    cleaned
    &nbsp;
    &nbsp;
    
    &#160;
    &nbsp;
    
    &#8217;
    &rsquo;
    

    I'll fiddle with the code tonight and see if I can get this to happen.

     
  • Seanster

    Seanster - 2014-05-24

    I was successful with my changes. I'd like to keep this bug on topic so I'll post those details in the forums maybe more appropriate for a discussion.

     
  • Scott Wilson

    Scott Wilson - 2014-08-13
    • Group: v 2.9 --> v 2.10
     
  • Scott Wilson

    Scott Wilson - 2014-10-31
    • Group: v 2.10 --> v 2.11
     
  • Seanster

    Seanster - 2014-10-31

    Thanks for keeping this open. I have to go back and diff my edits and then re-learn how that horrible function operates so I can narrow down the relevant changes to submit back to you. I did get very good html out of it and I want to get that back into the codebase asap.

     
  • Seanster

    Seanster - 2014-11-26

    Ok just pulled the latest svn version 377 and merged my older html changes and created a nice clean patch file as attached to this ticket. Just watch out for CRLF conversion, SVN doesn't seem set up for it and I have to strip CR's myself like a hillbilly.

    In SpecialEntities.java I simply commented out the "apos" entry. I didn't want apostrophes touched in any way and this was the quickest way to do it. Unfortunately whoever added it there had the same mentality haha.

    I don't have enough experience with the source code to be qualified to track down the special cases where the html serializer needs this gone. This will have to be done in order to support both html and xml output.

    In Utils.java I modified the escapeXml function. I jump in when there's an ampersand found and isHtmlOutput is enabled.

    If it's a special entity we leave it in place. We should skip the pointer ahead to the end of the entity before looping back to process input but that line is commented out. I'm not sure exactly how much needs to be skipped. I'm sure you'd know so maybe you can fix that.

    If the second character is a hash # then we get the entity code and convert it to it's common html name. I added a special function "convert_To_Entity_Name" just for this purpose. This is not optional as long as isHtmlOutput is enabled. It should be optional.

    Otherwise we convert the amersand to &amp.

    After all that, there's the situation where you check to see if the character itself is a unicode special entity. In this situation I added an html specific result:

    isDomCreation? code.getHtmlString() : code.getEscapedValue()
    
    In the usual case like a unicode nonbreaking space character, it is converted to something like "&nbsp;"
    

    I think the whole "convert_To_Entity_Name" function was just a copy of convertToUnicode and bits and pieces stolen from wherever. I have no memory of writing it but it seems to work.

     
  • Scott Wilson

    Scott Wilson - 2015-01-19

    Eclipse really, really hates your patch file :)

    How about you just attach Utils.java and I'll take it from there?

     
  • Scott Wilson

    Scott Wilson - 2015-05-12
    • Group: v 2.11 --> v2.12
     
  • Scott Wilson

    Scott Wilson - 2015-05-15
    • Group: v2.12 --> v2.13
     
  • Martin Denham

    Martin Denham - 2015-05-26

    Hi,
    From the above it seems like there is no way to prevent DomSerializer from changing special entities like &auml;

    Is that still the case or is there a work around?

    Thanks
    Martin

     
  • Scott Wilson

    Scott Wilson - 2015-05-26

    Hi Martin,

    Seanster provided a patch to try and improve the special entities conversion, but I haven't been able to apply it as its not in a format Eclipse likes. If you can apply it to your local build and it fixes your problem, then please do export the patch and I'll try and get it into a release.

     
  • Martin Denham

    Martin Denham - 2015-05-27

    I had to do a quick fix for this because it affected production code. I mangled entities before processing so they were not recognised and then unmangled them after processing.

     
  • Scott Wilson

    Scott Wilson - 2015-07-01
    • Group: v2.13 --> v2.14
     
  • Scott Wilson

    Scott Wilson - 2015-08-24
    • Group: v2.14 --> v2.15
     
  • Scott Wilson

    Scott Wilson - 2015-10-01
    • Group: v2.15 --> v2.16
     
  • Seanster

    Seanster - 2015-10-07

    Hi. Here's my latest patch against the latest 2.16-snapshot r436. Sorry for the delay, I didn't get the message that the patch was not working until far too late. I hope it will work this time. With 5 revisions passing by it's quite a challenge to dig in there and re-learn how this all works.

    I strongly suspect the svn end-of-line configuration is incorrect on the server side. My tests suggest the server is thinking it should be CR when it's actually CR-LF. I can't get my linux svn client to convert correctly.

    The files I am checking out all have CR-LF line endings. I managed to leave those intact for my latest edits. This patch file definitely contains CR-LF line endings. I tested the patch and it worked fine:

    svn patch patchfile

    This patch affects ONLY HTML serializers and does 4 main things:

    1. if an html character entity is encountered like &nbsp or &shy just leave it alone.
    2. if an NCR is found that is in the entity list, convert it to the html entity name
      ie. – becomes &ndash
    3. if an apostrophe character is found, leave it as-is, don't convert it to the entity &apos
      (I know this is debatable but I really neeeeed it like that or it wreaks havoc)
    4. any unicode character matched in the entity list is converted to the html entity name
      ie. an actual nonbreaking space is converted to &nbsp, a quote is converted to &quot

    I created a new function called convert_To_Entity_Name that is nearly a copy of convertToUnicode.
    It could be simplified because it is only called in the case of &# discovered in the input stream, but I left it able to handle other cases if I needed to do more later.

    There is some logic that is irrelevant because isDomCreation is always false for HTML, and transSpecialEntitiesToNCR is not yet exposed to the command line.

    It may not be perfect for everyone using HTML serializers but I believe it's going in the right direction.

     

    Last edit: Seanster 2015-10-07
  • Scott Wilson

    Scott Wilson - 2015-10-23

    Patch applied! It'll be in release 2.16

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB