Hi guys,
Bit of real world HTML causing a crash, here's the stack trace:
java.lang.IllegalArgumentException
at java.lang.Character.toChars(Character.java:5172)
at org.htmlcleaner.Utils.convertToUnicode(Utils.java:364)
at org.htmlcleaner.Utils.escapeXml(Utils.java:156)
at org.htmlcleaner.Utils.isEmptyString(Utils.java:480)
at org.htmlcleaner.ContentNode.<init>(ContentNode.java:53)
at org.htmlcleaner.HtmlTokenizer.addSavedAsContent(HtmlTokenizer.java:328)
at org.htmlcleaner.HtmlTokenizer.content(HtmlTokenizer.java:815)
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:461)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371)
And a test case that reproduces the issue:
public void testUnicodeIssue()
{
final String HTML = "<html>"
+ "<body>Brine�s."
+ "</body>"
+ "</html>";
try
{
final TagNode tagNode = new HtmlCleaner().clean(HTML);
final CleanerProperties cleanerProperties = new CleanerProperties();
new DomSerializer(cleanerProperties).createDOM(tagNode);
}
catch (IllegalArgumentException e)
{
e.printStackTrace();
fail();
}
catch (ParserConfigurationException e)
{
e.printStackTrace();
}
}
Using 2.16 btw!
Fixed - the unicode parser now handles invalid code points more gracefully.