#153 &# incorrectly interpreted as entity, leads to IllegalArgumentException

1.9.15
closed-fixed
None
5
2014-02-13
2013-12-13
danrabe
No

I'm reporting this against 1.9.19 (the bug tracker doesn't offer that as an option).

We had some HTML that includes the following string:
http://example.com/jive3/thread.jspa?messageID=6507741&#6507741

(Yes, sadly, there are systems out there that generate URLs like this.)

When parsing this, NekoHTML attempts to treat &#6507741 as an entity reference, even though it is not terminated with a semicolon. It then gets an exception when trying to interpret the reference:
Exception: java.lang.IllegalArgumentException
Message: null
38: at java.lang.Character.toChars(Character.java:2584)
37: at org.cyberneko.html.HTMLScanner.appendChar(HTMLScanner.java:1675)
36: at org.cyberneko.html.HTMLScanner.scanEntityRef(HTMLScanner.java:1384)
35: at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2049)
34: at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
33: at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
32: at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
31: at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:167)

I would suggest two changes:
(1) By default, don't interpret this as an entity reference unless it ends with a semicolon.
(2) If an entity that is encountered is not a valid code point, catch the IllegalArgumentException from Character.toChars, log the error, and carry on - preferably leaving the original string intact, but if you want to strip it out altogether, that's okay too. Either way would be more useful than propagating the IllegalArgumentException.

Discussion

  • Marc Guillemot

    Marc Guillemot - 2014-02-13
    • status: open --> closed-fixed
    • assigned_to: Marc Guillemot
     
  • Marc Guillemot

    Marc Guillemot - 2014-02-13

    In fact this has been fixed by commit 335:
    https://sourceforge.net/p/nekohtml/code/335

    Entities with an invalid UTF-16 code are replaced by the replacement character �

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks