Starting in the 2.19 release, an unclosed CDATA tag can cause an ArrayIndexOutOfBoundsException.
java.lang.ArrayIndexOutOfBoundsException: -1000
at org.htmlcleaner.HtmlTokenizer.startsWith(HtmlTokenizer.java:175)
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:442)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:461)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371)
Minimal test case:
<script><![CDATA[xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx That is "<script><![CDATA[" followed by some padding to get it up to at least 1024 characters. </script>
Thanks Michael - confirmed bug. I'm checking in a fix for it now.
Hi Scott, the exception disappeared. Thanks for the fix.
However, can you have a look at this page (attached). Here as far I see an unclosed CDATA section at the beginning of the document makes HtmlCleaner remove almost all valuable content.
I'm using HtmlCleaner v2.21 DomSerializer output. Is there a way to make CDATA parsing a bit more intuitive and in such an unbalanced case stop where it "should"?
Hi Papp!
I can't see an unclosed
CDATAin that page; there is however ascripttag that may be to blame for some odd behaviour - it has a type of "text/x-ab-test" and contains HTML tags including a nested script tag:With XMLSerializer, we have this output:
With DomSerializer, we have this output:
So I think there is a problem here with DomSerializer.
D'oh no that was just my debug code!
Hi Scott,
I made a terrible mistake because I previously attached the version I was playing a bit. Sorry about that.
Now attached the original but for the sake of safety here you can find it online: http://www.jpost.com/Breaking-News/Putin-congratulates-Frances-Macron-urging-for-united-efforts-amid-terror-threats-490105
Here is the incriminated CDATA section that closing tag lacks the '>':
If I switch to XML syntax in my IDE then highlights the rest of the document as part of CDATA section.
But you are right there are other errors in this page.
The full story is that I was playing with version 2.19 and I got the same exception as was posted by Michael originally. I gave a try to v2.21 but still could not get the main content.
Last edit: Győző Papp 2017-05-13
I have to say this is a really weird problem!
OK, I think I vaguely know what is happening now - its caused by the content of the unclosed CDATA tag exceeds the token buffer size (1024) so winding back to close the CDATA tag puts it in an odd location.
I've created a new bug for this - its #189 - follow it there for more updates