Menu

#157 IllegalArgumentException when parsing content with "a" tags which contain fragment just after that "&" parameter separator

1.9.15
open
nobody
None
5
2015-01-27
2015-01-18
No

Here is the unit tests that reproduces the unexpected behavior

@Test
public void testInvalidCodePoint() throws Exception {
    String html = "<html><body><a href='http://localhost?param1=value1&#5274124'>Hello World!</a></body></html>";
    InputSource url = new InputSource(new ByteArrayInputStream(html.getBytes()));

    DOMParser parser = new DOMParser();
    parser.parse(url);
}

The stacktrace is the following

java.lang.IllegalArgumentException
    at java.lang.Character.toChars(Character.java:4982)
    at org.cyberneko.html.HTMLScanner.appendChar(HTMLScanner.java:1685)
    at org.cyberneko.html.HTMLScanner.access$1000(HTMLScanner.java:98)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:3027)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2851)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2700)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2110)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

The issue seems to be in the interpreting &#5274124 value as codepoint, but in that case "&" is the parameter separator and "#5274124" is the fragment id.

Discussion


Log in to post a comment.