Menu

<script> tag issue

Help
Ramonus
2006-05-29
2013-04-27
  • Ramonus

    Ramonus - 2006-05-29

    Hi,

    I tried the latest version (1.6-20060527). I know that an issue (tracker id 1457371) is solved in this version concerning quotes and script tags. Great!  However, now I'm facing another script issue. The following piece of html code:

    <script>
        navbar = "</A><A>";
        document.write("This line of code is parsed as text which is not part of the script tag");
    </script>

    The problem is that the second line of the script is parsed as text, which is not part of the script tag. I'm using the parser to parse html to plain text and this way javascript code will be part of the plain text.

    I realize that the forward slash in the </A> tag is invalid. It should have a backslash in front of it. However, it should be nice if the parser should be tolerant for invalid code.

    Does anybody know how I can avoid this behaviour, or is this a bug?

    Regards,

    Ramon

     
    • Derrick Oswald

      Derrick Oswald - 2006-05-29

      As mentioned in the bug report:
      The default for ScriptScanner.STRICT was set to true. If you want the older, more lax, script parsing, set it to false with code like:
        org.htmlparser.scanners.ScriptScanner.STRICT = false;

       
      • Ramonus

        Ramonus - 2006-05-29

        So I guess it's not possible to handle both situations in a way that the text belonging to the script is detected as a whole.
        Too bad. I'm very pleased using the parser to extract content from html documents. I'm filtering the script and stylesheet tags, because no useful content will be in there. Is there a way to ignore the complete text between script/style tags?

        Ramon

         
        • Derrick Oswald

          Derrick Oswald - 2006-05-29

          It's been something that's been requested for more than two years. See RFE #886862 parse ecmascript;
          http://sourceforge.net/tracker/index.php?func=detail&aid=886862&group_id=24399&atid=381402

          It's a tough problem because there is just so much rubbish out there in the wild. Unless the program is capable of saying "that's Javascript" or "that doesn't look like Javascript" it's almost impossible to come up with a correct parse in all cases.  I mean, the javascript in a page doesn't have to run or even be correct, really.

          That being said, most browsers do a better job of handling the errant code than the parser does, and my guess is this is because they are actually trying to interpret the javascript. This is something that would be nice to have if it didn't slow down the parser too much.

           
    • Ramonus

      Ramonus - 2006-05-30

      I can imagine this is a tough problem. For now I'll choose the SCRIPT = true variant. In that case all relevant text will be parsed. The unwanted javascript code I will have to take for granted. Anyway, thanks for your replies and the hard work you invested in this project.

      Ramon

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.