I've just encountered some real world HTML that results in htmlcleaner never returning from a call to clean(). If I fire up jvisualvm I can see the memory usage going up and up untill it runs out of memory. I've boiled the html down to the smallest snippet I can get that recreates the issue and put it in a little CLI program that should make it easy to reproduce:
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
public class HtmlCleanerIssue
{
private static final String HTML = "<html>"
+ "<body>"
+ "<UL>"
+ "<LI>"
+ "<A href=about-inma.cfm>"
+ "<figure class=sub-nav-avatar>"
+ "<svg width=\"200\" height=\"200\" viewBox=\"0 0 200 200\" xmlns=\"http://www.w3.org/2000/svg\">"
+ "<TITLE>about</TITLE>"
+ "<desc>Created with Sketch.</desc>"
+ "<g fill=\"none\" fill-rule=\"evenodd\"><g>"
+ "<g transform=\"translate(-67 -180) translate(67 180)\">"
+ "<circle class=\"icon-bg\" fill=\"#2D92FF\" cx=\"100\" cy=\"100\" r=\"100\"/>"
+ "<g transform=\"translate(50.877 50.877)\">"
+ "<circle stroke=\"#fff\" stroke-width=\"4\" cx=\"48.581\" cy=\"48.999\" r=\"48.246\"/>"
+ "<path d=\"M49.803 38.96h-2.926v31.12h2.926V38.96zm1.13-10.05c0-1.62-1.197-2.594-2.66-2.594-1.462 0-2.66.973-2.66 2.594 0 1.488 1.198 2.53 2.66 2.53 1.463 0 2.66-1.042 2.66-2.53z\" fill=\"#fff\"/>"
+ "</g>"
+ "</g>"
+ "</g>"
+ "</g>"
+ "</svg>"
+ "</figure>"
+ "<SPAN class=sub-nav-title>About INMA</SPAN> </A> </LI>"
+ "</UL>"
+ "</body>"
+ "</html>";
public static void main(
String[] args) {
System.out.println("Cleaning....");
final TagNode tagNode = new HtmlCleaner().clean(HTML);
System.out.println("done!" + tagNode);
}
}
If I've missed anything just let me know!
Thanks for the report CB,
I've narrowed down the case that causes this to:
So some combination of these tags is causing a problem.
Great stuff, cheers Scott!
Turns out this was to do with inconsistent handling of upper/lower case when using a mix of namespaces and potentially conflicting tags (svg TITLE vs. html title). I've checked in a fix for this into trunk if you'd like to give it a whirl.
Tested - works great, thanks!