Here is the unit tests that reproduces the unexpected behavior
@Test
public void testInvalidCodePoint() throws Exception {
String html = "<html><body><a href='http://localhost?param1=value1�'>Hello World!</a></body></html>";
InputSource url = new InputSource(new ByteArrayInputStream(html.getBytes()));
DOMParser parser = new DOMParser();
parser.parse(url);
}
The stacktrace is the following
java.lang.IllegalArgumentException
at java.lang.Character.toChars(Character.java:4982)
at org.cyberneko.html.HTMLScanner.appendChar(HTMLScanner.java:1685)
at org.cyberneko.html.HTMLScanner.access$1000(HTMLScanner.java:98)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:3027)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2851)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2700)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2110)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
The issue seems to be in the interpreting � value as codepoint, but in that case "&" is the parameter separator and "#5274124" is the fragment id.
fixed in https://github.com/HtmlUnit/htmlunit-neko