HtmlCleaner / Bugs / #176 Crash: IllegalArgumentException in convertToUnicode

#176 Crash: IllegalArgumentException in convertToUnicode

Milestone: v2.17

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2016-08-17

Created: 2016-07-28

Creator: Code Buddy

Private: No

Hi guys,

Bit of real world HTML causing a crash, here's the stack trace:

java.lang.IllegalArgumentException
    at java.lang.Character.toChars(Character.java:5172)
    at org.htmlcleaner.Utils.convertToUnicode(Utils.java:364)
    at org.htmlcleaner.Utils.escapeXml(Utils.java:156)
    at org.htmlcleaner.Utils.isEmptyString(Utils.java:480)
    at org.htmlcleaner.ContentNode.<init>(ContentNode.java:53)
    at org.htmlcleaner.HtmlTokenizer.addSavedAsContent(HtmlTokenizer.java:328)
    at org.htmlcleaner.HtmlTokenizer.content(HtmlTokenizer.java:815)
    at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:461)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371)

And a test case that reproduces the issue:

    public void testUnicodeIssue()
    {
        final String HTML = "<html>"

                + "<body>Brine&#2013266066;s."
                + "</body>"
                + "</html>";
        try
        {
            final TagNode tagNode = new HtmlCleaner().clean(HTML);
            final CleanerProperties cleanerProperties = new CleanerProperties();
            new DomSerializer(cleanerProperties).createDOM(tagNode);
        }
        catch (IllegalArgumentException e)
        {
            e.printStackTrace();
            fail();
        }
        catch (ParserConfigurationException e)
        {
            e.printStackTrace();
        }
    }

Discussion

Code Buddy - 2016-07-28

Using 2.16 btw!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2016-08-17

status: open --> closed-fixed

Group: v 2.7 --> v2.17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2016-08-17

Fixed - the unicode parser now handles invalid code points more gracefully.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Crash: IllegalArgumentException in convertToUnicode

Group

Searches

Help

#176 Crash: IllegalArgumentException in convertToUnicode

Discussion