Here's a test case that exposses the issue:
@Test
public void testCrash()
{
final String HTML = "<html xmlns=\"foo\">"
+ "<head>"
+ "</head>"
+ "<body>"
+ "<table>"
+ "<tr>"
+ "<td>"
+ "<BR />"
+ "</td>"
+ "</tr>"
+ "</table>"
+ "<div>"
+ "</div>"
+ "</body>"
+ "</html>";
new HtmlCleaner().clean(HTML);
}
Found in 2.14 and also present in 2.15. On the face of it looks a lot like bug 133 but subtly difference as that test case still passes.
Playing around with the HTML seems like you need a few things to trigger this - the namespace, uppercase br tag and the div.
Shout if you need anymore info!
Good catch CB! I'll see if I can track down the cause of it.
Thats great, thanks Scott!
Right, well first off, as we're declaring quite clearly "This isn't HTML, its FOO" then a lot of the usual rules on tags get suspended - so everything in the doc is a plain old XML tag. Something weird then happens with the self-closing BR tag not actually closing: without the self-closing tag we're all good still.
So what I think happens is that we get to the self-closing BR tag. We're using HTML5 rules, so there is no such things as a self-closing tag, and we treat it as an open tag. Next we process the DIV tag. We check the last open tag, BR, and find it doesn't accept anything, so it finds the last valid open HTML tag. Which doesn't exist as there are no HTML tags that allow child elements as there are no HTML tags inside the BODY.
I can easily add a NULL check on the saveToLastOpenTag() method. Which handles the NPE, but then the DIV vanishes as it can't be placed elsewhere.
What we would expect is for the self-closing tag to be handled appropriately.
Just update to 2.16 and this seems to be fixed, certinaly for my test cases. I got a mention in the release notes, so assume this one just needs to be marked as closed :-)