Menu

#187 Multiple title tags removed when cleaning

v2.20
closed-wont-fix
nobody
None
5
2017-05-11
2017-05-03
Code Buddy
No

I've just noticed an issue where multple title tags are strippe. Here's an example to illustrate:

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

public class HTMLCleanerIssue
{
    private static final String HTML = "<html>"

            + "<head>"
            + "<title>title 1</title>"
            + "<title>title 2</title>"
            + "</head>"
            + "<body>"
            + "<h1>h1 1</h1>"
            + "<h1>h2 2</h1>"
            + "</body>"
            + "</html>";

    public static void main(
        final String[] args) 
    {
        final TagNode tagNode = new HtmlCleaner().clean(HTML);
        print(tagNode);
    }

    private static void print(
        final TagNode tagNode)
    {
        System.out.println(tagNode);
        tagNode.getChildTagList().stream().forEach(child -> print(child));
    }   
}

This prints:

html
head
title
body
h1
h1

I'd expected it to print:

html
head
title
title
body
h1
h1

Thanks!

Discussion

  • Scott Wilson

    Scott Wilson - 2017-05-03

    Hi CB!

    Only one title tag is allowed according to the HTML5 spec:

    https://www.w3.org/TR/html5/document-metadata.html#the-title-element

    ... and the XHTML spec:

    https://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_title

    Which is why you only get one title in the output.

    (There used to be an SEO practice of using multiple title tags to trick Google, but it hasn't worked in years)

     
    • Code Buddy

      Code Buddy - 2017-05-03

      Hi Scott,

      Thanks for this. I had a feeling this might be intentional. For our usecase we're running xPath functions on html to spot things like the number of title tags being > 1, so we do get some unexpected results as it stands. Thanks for clarifiying the behaviour!

       
  • Scott Wilson

    Scott Wilson - 2017-05-03

    If you did want to change the behaviour, its defined in Html5TagProvider:

    tagInfo = new TagInfo("title", ContentType.text, BelongsTo.HEAD, false,
            true, false, CloseTag.required, Display.none);
    

    The true in the TagInfo constructor is for the unique property, which tells HC to only allow one instance of the tag. If you were to change that to false, then you could have multiple title tags.

     

    Last edit: Scott Wilson 2017-05-03
    • Code Buddy

      Code Buddy - 2017-05-03

      Great stuff, thanks Scott, we'll do that!

       
  • Scott Wilson

    Scott Wilson - 2017-05-11
    • status: open --> closed-wont-fix
     

Log in to post a comment.

MongoDB Logo MongoDB