Renderer class joins words in consecutive anchor tags text
Brought to you by:
mjericho
We have been using the Renderer class to pull out visible text from users web pages to perform spell checking on the words extracted.
We have noticed that it will join up words if you have two anchor elements on the same line i.e.
sometext
This will cause Render.toString() to show "sometext" instead of "some text".
Is it possible to show these as seperate words as this is visibly how they would appear on the web page (after applying any css to style the text).
Here is some code to show the issue
public void testIssue() { final String html = "<html>\n" + " <body>\n" + " <a href='foo.com'>some</a><a href='foo.com'>text</a>\n" + " </body>\n" + "</html>"; final Source source = new Source(html); final Renderer renderer = new Renderer(source); String visiblePage = renderer.toString(); System.out.println(visiblePage); assertTrue(visiblePage.contains("some text")); }
Hi Daniel,
It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above.
Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks. Ironically, the bug adds a space after every rendered hyperlink. Are you already using the 3.5-DEV version with this bug fixed? It is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip
If you're using version 3.5-dev and you want to ensure there is a space after every rendered hyperlink, you could configure the renderer using the following command:
renderer.setHyperlinkContentDelimiters(null," ");
This would ensure the behaviour you're looking for, but the rendered output would not be consistent with browser rendering.
If you need to customise the renderer further for the purposes of your spell checker, you might consider creating a copy of the Renderer class source code so you can tweak it any way you like. It contains an inner class called A_ElementHandler which gives you full control of how hyperlinks are rendered.
Cheers
Martin
Hi Martin,
Thanks for your response.
The renderer.setHyperlinkContentDelimiters(null," ") method would work for us.
However, this doesn't seem to work for me. We have Jericho-thml-3.5-dev-3.
The following code with the above line setHyperlinkContentDelimiters call still concatenates the words "some" and "text":
public void testIssue()
{
final String html =
"\n" +
" \n" +
" sometext\n" +
" \n" +
"";
}
Hi Daniel,
In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method.
Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered.
I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler, as mentioned before. You're trying to achieve something that isn't consistent with normal HTML rendering so it's not something the library accommodates.
Again, I suggest you reconsider whether your spell checker should in fact flag "sometext" as a spelling error, as that's how your example would render in a browser.
Cheers
Martin
Hi Martin,
The customer who reported this issue gave this web page as an example that showed this issue:
https://leadsforward.com/generating-the-best-solar-leads-before-the-end-of-2019/
The Render class joins the words "Do" and "Lead" which are on different lines in the visible page:
I've attached a screenshot that shows the visible page and the source code below.
Thanks
Danny
The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.
Ok thanks, I understand and agree with your point.
I'll give it some thought of how to handle this scenario with customers that have this issue and decide whether fixing this up silently for them by customising the Renderer class or leave it as is, as that is what a non styled representation of their page would look like.
Thanks for your help
Danny