I just ran into this myself. I believe the behavior change occurred in https://sourceforge.net/p/htmlcleaner/code/521/ (released in 2.22). That change made it so that most HTML entities (e.g., "ã") are now decoded (which is good for me!), but some characters that previously were not encoded are now being encoded, such as & and >. Even though escapeXml=false is passed into the DomSerializer constructor, Utils.escapeXml() still ends up getting called because recognizeUnicodeChars=true. The name...
I just ran into this myself. I believe the behavior change occurred in https://sourceforge.net/p/htmlcleaner/code/521/ (released in 2.22). That change made it so that most HTML entities (e.g., "ã") are now decoded (which is good for me!), but some characters that previously were not encoded are now being encoded, such as & and >. Even though escapeXml=false is passed into the DomSerializer constructor, Utils.escapeXml() still ends up getting called because recognizeUnicodeChars=true. The name of...
Here's the HTML (once again eaten by sourceforge): <time><b><li>
Infinite loop on <time><b><li>
Oops, sourceforge ate part of the markup. Here's the minimal test case: <html xmlns="x"><ul><a> And the stacktrace: java.lang.NullPointerException at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:1097) at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103) at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:546) at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:480) at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:461) at org.htmlcleaner.HtmlCleaner.clean...
Oops, sourceforge ate part of the markup. Here's the minimal test case: <html xmlns="x"><ul><a>
NullPointerException in HtmlCleaner.makeTree
Unclosed CDATA can cause ArrayIndexOutOfBoundsException