Menu

#267 Incorrect parsing of <Script> when it contains HTML

v2.0
open
nobody
5
2008-12-18
2008-12-18
No

Try parsing:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
</head>
<body>
<!-- This has obviously been edited -->
<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</a>");</script>
</body>
</html>

-----
It parses it as : (as obtained from toHtml(true)):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">\ <html>
<body>
<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</script></a>");</script>
</body>
</html>

It's trying to parse "</a>" as a new node, which is incorrect since it's part of the script text.

Discussion

  • Scott Montgomerie

    I figured out a workaround, I can submit a patch if anybody is interested. It involves a simple change in Lexer.parseCDATA(). Basically you need to keep track of quotes, even if "quotesmart" isn't true, so that if you are in quotes, it won't take an end tag enclosed in quotes as a poorly-formed HTML end tag.

     
  • Scott Montgomerie

    Patch

     
  • Scott Montgomerie

    File Added: patch.txt

     
  • Scott Montgomerie

    Added a patch.

     
  • thushara wijeratna

    i'm interested in the patch - i need to extract text that is outside script blocks, and this bug sometimes hands me blocks of script as text. it seems to stop parsing when it encounters a </b> inside a quoted string in the script block.

     
  • thushara wijeratna

    i tried this patch and it works for me - thanks.

     
  • thushara wijeratna

    by setting ScriptScanner.STRICT = false, we can get better results. this sets the "quotesmart" variable in the parsing code.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.