When cleaning urls from slacker.com such as : http://www.slacker.com/album/bread/the-best-of-bread
Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.BaseToken
at org.htmlcleaner.TagNode.addChildren(TagNode.java:429)
at org.htmlcleaner.TagNode.addChild(TagNode.java:408)
at org.htmlcleaner.HtmlCleaner.createDocumentNodes(HtmlCleaner.java:957)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:433)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:338)
in trunk the above url works ok but fails with similar exception on
http://www.bajarmusicamp3.org/video/NvR60Wg9R7Q/bon-jovi-bed-of-roses.html
Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.TagNode
at org.htmlcleaner.HtmlCleaner.saveToLastOpenTag(HtmlCleaner.java:627)
at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:891)
at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:551)
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:424)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:334)
at sknil.utils.Internet$.fixHTML(Internet.scala:160)
at sknil.utils.Internet$$anonfun$12.apply(Internet.scala:389)
at sknil.utils.Internet$$anonfun$12.apply(Internet.scala:389)
at scala.Option.map(Option.scala:145)
at sknil.utils.Internet$.fetchUrl(Internet.scala:389)
at sknil.utils.Internet$.main(Internet.scala:714)
at sknil.utils.Internet.main(Internet.scala)
Thanks for the report Haadar. I've narrowed the minimal test case to cause this problem:
The problem seems to be that the page declares itself to be an XML document, and then includes an unknown tag ("META"), which is unclosed.
If the namespace declaration is correct, e.g.:
(Note the "www"!) then the error also doesn't arise.
So basically its a combination of problems that HtmlCleaner isn't able to handle:
The simplest fix I have for this is to be more forgiving with identifying the XHTML namespace. If I just change the following line in HtmlCleaner.java:
... then everything seems OK.
What do you think, does this seem a reasonable solution?
Fix applied.