Menu

#91 Renderer class joins words in consecutive anchor tags text

General
unread
nobody
None
5
2020-10-15
2020-10-13
No

We have been using the Renderer class to pull out visible text from users web pages to perform spell checking on the words extracted.

We have noticed that it will join up words if you have two anchor elements on the same line i.e.
sometext

This will cause Render.toString() to show "sometext" instead of "some text".
Is it possible to show these as seperate words as this is visibly how they would appear on the web page (after applying any css to style the text).

Here is some code to show the issue

public void testIssue()
{
    final String html =
            "<html>\n" +
            "  <body>\n" +
            "    <a href='foo.com'>some</a><a href='foo.com'>text</a>\n" +        
            "  </body>\n" +
            "</html>";

    final Source source = new Source(html);                        
    final Renderer renderer = new Renderer(source);

    String visiblePage = renderer.toString();

    System.out.println(visiblePage);
    assertTrue(visiblePage.contains("some text"));
}

Discussion

  • Martin Jericho

    Martin Jericho - 2020-10-13

    Hi Daniel,

    It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above.

    Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks. Ironically, the bug adds a space after every rendered hyperlink. Are you already using the 3.5-DEV version with this bug fixed? It is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip

    If you're using version 3.5-dev and you want to ensure there is a space after every rendered hyperlink, you could configure the renderer using the following command:
    renderer.setHyperlinkContentDelimiters(null," ");
    This would ensure the behaviour you're looking for, but the rendered output would not be consistent with browser rendering.

    If you need to customise the renderer further for the purposes of your spell checker, you might consider creating a copy of the Renderer class source code so you can tweak it any way you like. It contains an inner class called A_ElementHandler which gives you full control of how hyperlinks are rendered.

    Cheers
    Martin

     
  • Daniel Gonzalez

    Daniel Gonzalez - 2020-10-15

    Hi Martin,
    Thanks for your response.

    The renderer.setHyperlinkContentDelimiters(null," ") method would work for us.
    However, this doesn't seem to work for me. We have Jericho-thml-3.5-dev-3.

    The following code with the above line setHyperlinkContentDelimiters call still concatenates the words "some" and "text":

    public void testIssue()
    {
    final String html =
    "\n" +
    " \n" +
    " sometext\n" +
    " \n" +
    "";

    final Source source = new Source(html);                        
    final Renderer renderer = new Renderer(source);
    renderer.setHyperlinkContentDelimiters(null, " ");
    String visiblePage = renderer.toString();
    
    System.out.println(visiblePage);
    assertTrue(visiblePage.contains("some text"));
    

    }

     
  • Martin Jericho

    Martin Jericho - 2020-10-15

    Hi Daniel,

    In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method.

    Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered.

    I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler, as mentioned before. You're trying to achieve something that isn't consistent with normal HTML rendering so it's not something the library accommodates.

    Again, I suggest you reconsider whether your spell checker should in fact flag "sometext" as a spelling error, as that's how your example would render in a browser.

    Cheers
    Martin

     
  • Martin Jericho

    Martin Jericho - 2020-10-15

    The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.

     
  • Daniel Gonzalez

    Daniel Gonzalez - 2020-10-15

    Ok thanks, I understand and agree with your point.

    I'll give it some thought of how to handle this scenario with customers that have this issue and decide whether fixing this up silently for them by customising the Renderer class or leave it as is, as that is what a non styled representation of their page would look like.

    Thanks for your help

    Danny

     

Log in to post a comment.