Menu

#124 classcastexception

v 2.9
closed-fixed
nobody
None
5
2014-08-13
2014-08-13
Haadar
No

When cleaning urls from slacker.com such as : http://www.slacker.com/album/bread/the-best-of-bread
Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.BaseToken
at org.htmlcleaner.TagNode.addChildren(TagNode.java:429)
at org.htmlcleaner.TagNode.addChild(TagNode.java:408)
at org.htmlcleaner.HtmlCleaner.createDocumentNodes(HtmlCleaner.java:957)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:433)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:338)

Discussion

  • Haadar

    Haadar - 2014-08-13

    in trunk the above url works ok but fails with similar exception on
    http://www.bajarmusicamp3.org/video/NvR60Wg9R7Q/bon-jovi-bed-of-roses.html
    Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.TagNode
    at org.htmlcleaner.HtmlCleaner.saveToLastOpenTag(HtmlCleaner.java:627)
    at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:891)
    at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
    at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:551)
    at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:424)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:334)
    at sknil.utils.Internet$.fixHTML(Internet.scala:160)
    at sknil.utils.Internet$$anonfun$12.apply(Internet.scala:389)
    at sknil.utils.Internet$$anonfun$12.apply(Internet.scala:389)
    at scala.Option.map(Option.scala:145)
    at sknil.utils.Internet$.fetchUrl(Internet.scala:389)
    at sknil.utils.Internet$.main(Internet.scala:714)
    at sknil.utils.Internet.main(Internet.scala)

     
  • Scott Wilson

    Scott Wilson - 2014-08-13

    Thanks for the report Haadar. I've narrowed the minimal test case to cause this problem:

    <html xmlns="http://w3.org/1999/xhtml">
    <head>
    <META name="blah"> 
    <meta name="blah"/> 
    </head>
    <body>
    </body>
    </html>
    

    The problem seems to be that the page declares itself to be an XML document, and then includes an unknown tag ("META"), which is unclosed.

    If the namespace declaration is correct, e.g.:

    <html xmlns="http://www.w3.org/1999/xhtml">
    

    (Note the "www"!) then the error also doesn't arise.

    So basically its a combination of problems that HtmlCleaner isn't able to handle:

    1. Incorrect namespace URL for XHTML
    2. An Xhtml element with the wrong case
    3. An unclosed meta tag in XHTML

    The simplest fix I have for this is to be more forgiving with identifying the XHTML namespace. If I just change the following line in HtmlCleaner.java:

            650 if (ns.equals("http://www.w3.org/1999/xhtml") || ns.equals("http://w3.org/1999/xhtml")) return false;
    

    ... then everything seems OK.

    What do you think, does this seem a reasonable solution?

     
  • Scott Wilson

    Scott Wilson - 2014-08-13
    • status: open --> closed-fixed
    • Group: v 2.8 --> v 2.9
     
  • Scott Wilson

    Scott Wilson - 2014-08-13

    Fix applied.

     

Log in to post a comment.