HtmlCleaner / Bugs / #153 NullPointerException when <!DOCTYPE> doesn't contain a qualifiedName

#153 NullPointerException when <!DOCTYPE> doesn't contain a qualifiedName

Milestone: v2.16

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-10-23

Created: 2015-10-01

Creator: Code Buddy

Private: No

Found in 2.14 and checked against 2.15. The following test case produces a NullPointerException:

public void testWithInvalidDocType()
{
    final String HTML = "<!DOCTYPE>";
    final TagNode tagNode = new HtmlCleaner().clean(HTML);
    final CleanerProperties cleanerProperties = new CleanerProperties();
    try
    {
        new DomSerializer(cleanerProperties).createDOM(tagNode);
    }
    catch (ParserConfigurationException e)
    {
        e.printStackTrace();
    }
}

The code in DomSerializer::createDOM() checks the docType of the root not is not null:

if (rootNode.getDocType() != null){

But not its contents, so in the above case qualifiedName is now null at this point:

String qualifiedName = rootNode.getDocType().getPart1();

And the passed off to CoreDOMImplementationImpl::createDocumentType() and then checkQName() which does:

    int index = qname.indexOf(':');

on the null qname.

Here's the stack trace:

java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.dom.CoreDOMImplementationImpl.checkQName(CoreDOMImplementationImpl.java:176)
at com.sun.org.apache.xerces.internal.dom.CoreDOMImplementationImpl.createDocumentType(CoreDOMImplementationImpl.java:171)
at org.htmlcleaner.DomSerializer.createDOM(DomSerializer.java:100)

Any queries happy to provide more info - thanks!

Discussion

Scott Wilson - 2015-10-01

Thanks for spotting that one CB!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-10-01

Hmm, there are two ways of handling this.

Just make the QName "html" where null

if (qualifiedName == null) qualifiedName = "html";

Only create DocumentType where the DOCTYPE is valid:

rootNode.getDocType() != null && rootNode.getDocType.isValid()

Option 1 is the smallest change to fix the issue, but Option 2 feels better - we shouldn't be trying to use invalid DOCTYPEs in creating a DOM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-10-23

status: open --> closed-fixed

Group: v 2.7 --> v2.16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-10-23

I've applied the simple fix for now; in future I think it would be good to correct invalid doctypes wherever possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NullPointerException when <!DOCTYPE> doesn't contain a qualifiedName

Group

Searches

Help

#153 NullPointerException when <!DOCTYPE> doesn't contain a qualifiedName

Discussion