Menu

#185 Unclosed CDATA can cause ArrayIndexOutOfBoundsException

v2.20
closed-fixed
nobody
None
5
2017-05-14
2017-04-14
No

Starting in the 2.19 release, an unclosed CDATA tag can cause an ArrayIndexOutOfBoundsException.

java.lang.ArrayIndexOutOfBoundsException: -1000
at org.htmlcleaner.HtmlTokenizer.startsWith(HtmlTokenizer.java:175)
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:442)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:461)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371)

Minimal test case:

<script><![CDATA[xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx That is "<script><![CDATA[" followed by some padding to get it up to at least 1024 characters. </script>
1 Attachments

Discussion

  • Scott Wilson

    Scott Wilson - 2017-04-14
    • status: open --> open-accepted
     
  • Scott Wilson

    Scott Wilson - 2017-04-14

    Thanks Michael - confirmed bug. I'm checking in a fix for it now.

     
  • Scott Wilson

    Scott Wilson - 2017-04-14
    • status: open-accepted --> closed-fixed
     
  • Győző Papp

    Győző Papp - 2017-05-12

    Hi Scott, the exception disappeared. Thanks for the fix.

    However, can you have a look at this page (attached). Here as far I see an unclosed CDATA section at the beginning of the document makes HtmlCleaner remove almost all valuable content.

    I'm using HtmlCleaner v2.21 DomSerializer output. Is there a way to make CDATA parsing a bit more intuitive and in such an unbalanced case stop where it "should"?

     
  • Scott Wilson

    Scott Wilson - 2017-05-13

    Hi Papp!

    I can't see an unclosed CDATA in that page; there is however a script tag that may be to blame for some odd behaviour - it has a type of "text/x-ab-test" and contains HTML tags including a nested script tag:

    <script type="text/x-ab-test">
    <!-- /6943/JPost_2017/Desktop/All_Regular_Ad_Units/Article_970x250_1_Top -->
    <div id='div-gpt-ad-1478598873018-0'>
    <script>
    googletag.cmd.push(function() { googletag.display('div-gpt-ad-1478598873018-0'); });
    </xscript>
    </div>
    </script>
    

    With XMLSerializer, we have this output:

    <script type="text/x-ab-test">/*<![CDATA[*/
    <!-- /6943/JPost_2017/Desktop/All_Regular_Ad_Units/Article_970x250_1_Top -->
    <div id='div-gpt-ad-1478598873018-0'>
    <script>
    googletag.cmd.push(function() { googletag.display('div-gpt-ad-1478598873018-0'); });
    </xscript>
    </div>
    /*]]>*/</script>
    

    With DomSerializer, we have this output:

    <script type="text/x-ab-test">/**/
    
    <div id='div-gpt-ad-1478598873018-0'>
    <script>
    googletag.cmd.push(function() { googletag.display('div-gpt-ad-1478598873018-0'); });
    </xscript>
    </div>
    /*<!-- /6943/JPost_2017/Desktop/All_Regular_Ad_Units/Article_970x250_1_Top -->*/</script>
    

    So I think there is a problem here with DomSerializer.

     
  • Scott Wilson

    Scott Wilson - 2017-05-13

    D'oh no that was just my debug code!

     
  • Győző Papp

    Győző Papp - 2017-05-13

    Hi Scott,

    I made a terrible mistake because I previously attached the version I was playing a bit. Sorry about that.

    Now attached the original but for the sake of safety here you can find it online: http://www.jpost.com/Breaking-News/Putin-congratulates-Frances-Macron-urging-for-united-efforts-amid-terror-threats-490105

    Here is the incriminated CDATA section that closing tag lacks the '>':

    //<![CDATA[
    (function(){
    var b,c=window.deployads_ab_pct=10;b=Math.random()>c/100;var
    f=location.search.match(/[?&]deployads-ab=([^&]+)/);f&&2===f.length&&(b="pub"===f[1]);
    b&&(window.deployads=[],window.deployads.push=function(){var
    a=document.querySelectorAll('script[type\x3d"text/x-ab-test"]:not([data-processed])');if(a&&0<a.length){var a=a[0],e=a.innerHTML.replace(/xscript/g,"script");if("complete"!==document.readyState)document.write(e),a.setAttribute("data-processed","true");else{var d=a.parentElement;a.isProxyNode&&a.proxiedNode&&(a=a.proxiedNode);d.removeChild(a);d.innerHTML+=e;(window.adsbygoogle||[]).push({})}}return window.deployads.length}); window.deployads_disabled=b;})();
    //]]
    </script>
    

    If I switch to XML syntax in my IDE then highlights the rest of the document as part of CDATA section.

    But you are right there are other errors in this page.

    The full story is that I was playing with version 2.19 and I got the same exception as was posted by Michael originally. I gave a try to v2.21 but still could not get the main content.

     

    Last edit: Győző Papp 2017-05-13
  • Scott Wilson

    Scott Wilson - 2017-05-13

    I have to say this is a really weird problem!

     
  • Scott Wilson

    Scott Wilson - 2017-05-13

    OK, I think I vaguely know what is happening now - its caused by the content of the unclosed CDATA tag exceeds the token buffer size (1024) so winding back to close the CDATA tag puts it in an odd location.

     
  • Scott Wilson

    Scott Wilson - 2017-05-14

    I've created a new bug for this - its #189 - follow it there for more updates

     

Log in to post a comment.

MongoDB Logo MongoDB