Try parsing:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
</head>
<body>
<!-- This has obviously been edited -->
<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</a>");</script>
</body>
</html>
-----
It parses it as : (as obtained from toHtml(true)):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">\
<html>
<body>
<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</script></a>");</script>
</body>
</html>
It's trying to parse "</a>" as a new node, which is incorrect since it's part of the script text.
I figured out a workaround, I can submit a patch if anybody is interested. It involves a simple change in Lexer.parseCDATA(). Basically you need to keep track of quotes, even if "quotesmart" isn't true, so that if you are in quotes, it won't take an end tag enclosed in quotes as a poorly-formed HTML end tag.
Patch
File Added: patch.txt
Added a patch.
i'm interested in the patch - i need to extract text that is outside script blocks, and this bug sometimes hands me blocks of script as text. it seems to stop parsing when it encounters a </b> inside a quoted string in the script block.
i tried this patch and it works for me - thanks.
by setting ScriptScanner.STRICT = false, we can get better results. this sets the "quotesmart" variable in the parsing code.