Menu

#183 Real world html causes clean() to eat all available memory

v2.19
closed-fixed
nobody
None
5
2017-02-13
2017-02-02
Code Buddy
No

I've just encountered some real world HTML that results in htmlcleaner never returning from a call to clean(). If I fire up jvisualvm I can see the memory usage going up and up untill it runs out of memory. I've boiled the html down to the smallest snippet I can get that recreates the issue and put it in a little CLI program that should make it easy to reproduce:

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

public class HtmlCleanerIssue
{
    private static final String HTML = "<html>"

            + "<body>"
            + "<UL>"
            + "<LI>"
            + "<A href=about-inma.cfm>"
            + "<figure class=sub-nav-avatar>"
            + "<svg width=\"200\" height=\"200\" viewBox=\"0 0 200 200\" xmlns=\"http://www.w3.org/2000/svg\">"
            + "<TITLE>about</TITLE>"
            + "<desc>Created with Sketch.</desc>"
            + "<g fill=\"none\" fill-rule=\"evenodd\"><g>"
            + "<g transform=\"translate(-67 -180) translate(67 180)\">"
            + "<circle class=\"icon-bg\" fill=\"#2D92FF\" cx=\"100\" cy=\"100\" r=\"100\"/>"
            + "<g transform=\"translate(50.877 50.877)\">"
            + "<circle stroke=\"#fff\" stroke-width=\"4\" cx=\"48.581\" cy=\"48.999\" r=\"48.246\"/>"
            + "<path d=\"M49.803 38.96h-2.926v31.12h2.926V38.96zm1.13-10.05c0-1.62-1.197-2.594-2.66-2.594-1.462 0-2.66.973-2.66 2.594 0 1.488 1.198 2.53 2.66 2.53 1.463 0 2.66-1.042 2.66-2.53z\" fill=\"#fff\"/>"
            + "</g>"
            + "</g>"
            + "</g>"
            + "</g>"
            + "</svg>"
            + "</figure>"
            + "<SPAN class=sub-nav-title>About INMA</SPAN> </A> </LI>"
            + "</UL>"
            + "</body>"
            + "</html>";

    public static void main(
        String[] args) {
        System.out.println("Cleaning....");
        final TagNode tagNode = new HtmlCleaner().clean(HTML);
        System.out.println("done!" + tagNode);

    }
}

If I've missed anything just let me know!

Discussion

  • Scott Wilson

    Scott Wilson - 2017-02-02

    Thanks for the report CB,

    I've narrowed down the case that causes this to:

        String html = ""
    
                + "<svg xmlns=\"http://www.w3.org/2000/svg\">"
                + "<TITLE>about</TITLE>"
               + "</svg>"
               + "<SPAN>About INMA</SPAN>";
    

    So some combination of these tags is causing a problem.

     
    • Code Buddy

      Code Buddy - 2017-02-02

      Great stuff, cheers Scott!

       
  • Scott Wilson

    Scott Wilson - 2017-02-02

    Turns out this was to do with inconsistent handling of upper/lower case when using a mix of namespaces and potentially conflicting tags (svg TITLE vs. html title). I've checked in a fix for this into trunk if you'd like to give it a whirl.

     
  • Scott Wilson

    Scott Wilson - 2017-02-06
    • status: open --> closed-fixed
     
  • Scott Wilson

    Scott Wilson - 2017-02-06
    • Group: v2.18 --> v2.19
     
  • Code Buddy

    Code Buddy - 2017-02-13

    Tested - works great, thanks!

     

Log in to post a comment.

MongoDB Logo MongoDB