I'm reporting this against 1.9.19 (the bug tracker doesn't offer that as an option).
We had some HTML that includes the following string:
(Yes, sadly, there are systems out there that generate URLs like this.)
When parsing this, NekoHTML attempts to treat � as an entity reference, even though it is not terminated with a semicolon. It then gets an exception when trying to interpret the reference:
38: at java.lang.Character.toChars(Character.java:2584)
37: at org.cyberneko.html.HTMLScanner.appendChar(HTMLScanner.java:1675)
36: at org.cyberneko.html.HTMLScanner.scanEntityRef(HTMLScanner.java:1384)
35: at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2049)
34: at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
33: at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
32: at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
31: at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:167)
I would suggest two changes:
(1) By default, don't interpret this as an entity reference unless it ends with a semicolon.
(2) If an entity that is encountered is not a valid code point, catch the IllegalArgumentException from Character.toChars, log the error, and carry on - preferably leaving the original string intact, but if you want to strip it out altogether, that's okay too. Either way would be more useful than propagating the IllegalArgumentException.
Log in to post a comment.