Here is the unit tests that reproduces the unexpected behavior
@Test public void testInvalidCodePoint() throws Exception { String html = "<html><body><a href='http://localhost?param1=value1�'>Hello World!</a></body></html>"; InputSource url = new InputSource(new ByteArrayInputStream(html.getBytes())); DOMParser parser = new DOMParser(); parser.parse(url); }
The stacktrace is the following
java.lang.IllegalArgumentException at java.lang.Character.toChars(Character.java:4982) at org.cyberneko.html.HTMLScanner.appendChar(HTMLScanner.java:1685) at org.cyberneko.html.HTMLScanner.access$1000(HTMLScanner.java:98) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:3027) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2851) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2700) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2110) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
The issue seems to be in the interpreting � value as codepoint, but in that case "&" is the parameter separator and "#5274124" is the fragment id.