I've just noticed an issue where multple title tags are strippe. Here's an example to illustrate:
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
public class HTMLCleanerIssue
{
private static final String HTML = "<html>"
+ "<head>"
+ "<title>title 1</title>"
+ "<title>title 2</title>"
+ "</head>"
+ "<body>"
+ "<h1>h1 1</h1>"
+ "<h1>h2 2</h1>"
+ "</body>"
+ "</html>";
public static void main(
final String[] args)
{
final TagNode tagNode = new HtmlCleaner().clean(HTML);
print(tagNode);
}
private static void print(
final TagNode tagNode)
{
System.out.println(tagNode);
tagNode.getChildTagList().stream().forEach(child -> print(child));
}
}
This prints:
html
head
title
body
h1
h1
I'd expected it to print:
html
head
title
title
body
h1
h1
Thanks!
Hi CB!
Only one
titletag is allowed according to the HTML5 spec:https://www.w3.org/TR/html5/document-metadata.html#the-title-element
... and the XHTML spec:
https://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_title
Which is why you only get one
titlein the output.(There used to be an SEO practice of using multiple
titletags to trick Google, but it hasn't worked in years)Hi Scott,
Thanks for this. I had a feeling this might be intentional. For our usecase we're running xPath functions on html to spot things like the number of title tags being > 1, so we do get some unexpected results as it stands. Thanks for clarifiying the behaviour!
If you did want to change the behaviour, its defined in Html5TagProvider:
The
truein the TagInfo constructor is for theuniqueproperty, which tells HC to only allow one instance of the tag. If you were to change that tofalse, then you could have multipletitletags.Last edit: Scott Wilson 2017-05-03
Great stuff, thanks Scott, we'll do that!