<script> embeded in <body> p...

Help
Mark B
2012-08-13
2013-04-27
  • Mark B

    Mark B - 2012-08-13

    I am finding that the </span> tag in the following <script>, which is within the <body> tags of my HTML document, can cause the premature end of the <script>

    contrived example:

      <script>

        x = function() {
          var elem = document.findElementById('element1');
          if (elem) elem.innerHTML = 'hi <span>there</span> …';
        }

        y = function() {
        }

      <script>

      <div onclick='x()' id='element1'>before replacement</div>

    After parsing the HTML and then transforming it back into HTML, I get something like this (whitespace may be off)

      <script>

        x = function() {
          var elem = document.findElementById('element1');
          if (elem) elem.innerHTML = 'hi <span>there</script>… y = function() { }<div onclick='x()' id='element1'>before replacement</div>

     
  • Mark B

    Mark B - 2012-08-13

    Below is an example that I just ran through HTMLParser.

    When HTMLParser encounters the <script> tag, shouldn't it be looking for the </script> tag closure, not just *any* tag closure?

    INPUT
    >>>>>
    <html>
      <body>
        <script>
          x = function() {
            var elem = document.getElementById('element1');
            if (elem) elem.innerHTML = 'hi <span>there</span> you!';
          }
        </script>

        <div id='element1'>before replacement</div>

      </body>
    </html>

    <<<<<

    OUTPUT
    >>>>>
    <html>
      <body>
        <script>
          x = function() {
            var elem = document.getElementById('element1');
            if (elem) elem.innerHTML = 'hi <span>there</script></span> you!';
          }    </script>

        <div id='element1'>before replacement</div>

      </body>
    </html>

    <<<<<

     
  • Derrick Oswald

    Derrick Oswald - 2012-08-13

    To handle broken HTML like this, as described in org.htmlparser.scanners.ScriptScanner.java, you need to set member STRICT  to false.

    Strict parsing of CDATA flag. If this flag is set true, the parsing of script is performed without regard to quotes. This means that erroneous script such as:
         document.write("</script>");
    will be parsed in strict accordance with appendix B.3.2 Specifying non-HTML data of the HTML 4.01 Specification and hence will be split into two or more nodes. Correct javascript would escape the ETAGO:
         document.write("&lt;;/script>");

    If true, CDATA parsing will stop at the first ETAGO ("</") no matter whether it is quoted or not. If false, balanced quotes (either single or double) will shield an ETAGO. Beacuse of the possibility of quotes within single or multiline comments, these are also parsed. In most cases, users prefer non-strict handling since there is so much broken script out in the wild.

     
  • Mark B

    Mark B - 2012-08-13

    Thank you so much, Derrick.

    Of course, the situation is far more complicated than the example I gave.  The data that goes into the innerHTML is actually generated in Java, and the Java code doesn't know the context.  It doesn't know the text will be used in Javascript.  It's just generating a snippet of HTML to be used wherever.

    But your suggestions does work, just the same.

    Thanks, again.

     
  • Bruno Rezende Laranjeira

    I was trying to parse some pages found in the web and, as expected, it contains lots of malformed HTML.
    One issue that I could see was while handling javascript comments. For example, in the following:

    <script>
    var a = 3;
    //comment about random stuff </div>
    </script>
    

    The parser ends the script tag when it finds "</d", wheter it is quoted or not. What causes the problem is the line comment, started with the double slash.

    If you open the same code with a browser, you can see that it only closes the script tag where it is intended to.
    The browser would only behave the same way if the "</div>" was replaced by "</script>".

    I'm not sure if it is a bug, or something like this, but i fixed this by passing an extra parameter to the parseCDATA method, to inform the kind of thag that was calling it (script or style) and then, only closing the tag if the correct corresponding closing tag was found.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks