HTML Parser / Bugs / #267 Incorrect parsing of <Script> when it contains HTML

#267 Incorrect parsing of <Script> when it contains HTML

Milestone: v2.0

Status: open

Owner: nobody

Labels: Scanner Bug (53)

Priority: 5

Updated: 2008-12-18

Created: 2008-12-18

Creator: Scott Montgomerie

Private: No

Try parsing:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
</head>
<body>

<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</a>");</script>
</body>
</html>

-----
It parses it as : (as obtained from toHtml(true)):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">\
<html>
<body>
<script language="JavaScript">ImgW("http://cdn1.zone.msn.com/imagesiconhomeus_16_16_.gif", 1, "Newsletter", "<a class='ImgLnkTblI' href='http://g.msn.com/NEWS?intgid'>", "</script></a>");</script>
</body>
</html>

It's trying to parse "</a>" as a new node, which is incorrect since it's part of the script text.

Discussion

Scott Montgomerie - 2008-12-18

I figured out a workaround, I can submit a patch if anybody is interested. It involves a simple change in Lexer.parseCDATA(). Basically you need to keep track of quotes, even if "quotesmart" isn't true, so that if you are in quotes, it won't take an end tag enclosed in quotes as a poorly-formed HTML end tag.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Montgomerie - 2008-12-18

Patch

patch.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Montgomerie - 2008-12-18

File Added: patch.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Montgomerie - 2008-12-18

Added a patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thushara wijeratna - 2009-12-01

i'm interested in the patch - i need to extract text that is outside script blocks, and this bug sometimes hands me blocks of script as text. it seems to stop parsing when it encounters a </b> inside a quoted string in the script block.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thushara wijeratna - 2009-12-01

i tried this patch and it works for me - thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thushara wijeratna - 2009-12-04

by setting ScriptScanner.STRICT = false, we can get better results. this sets the "quotesmart" variable in the parsing code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Incorrect parsing of <Script> when it contains HTML

Group

Searches

Help

#267 Incorrect parsing of <Script> when it contains HTML

Discussion