parse gets confused inside script tags

Brought to you by: andreas_kupries, dev_null42a, ericm, hobbs, and 6 others

#869 parse gets confused inside script tags

Status: open

Owner: Andreas Kupries

Labels: htmlparse (25)

Priority: 5

Updated: 2006-11-30

Created: 2006-11-30

Creator: David Scott Cargo

Private: No

The following example shows how the parser in htmlparse gets confused by the JavaScript expression i<w..., incorrectly recognizing it as a tag.

<HTML>
<HEAD>
<TITLE>Demonstrate Bug</TITLE>
<SCRIPT LANGUAGE="JavaScript">
function resetTheForm()
{
var i;

// Make sure the forms object exists
if(window.document.forms)
{
// Reset each form
for(i=0; i<window.document.forms.length; i++)
window.document.forms[i].reset();
}
}
</SCRIPT>
</HEAD>
<BODY>
<P>Nothing interesting here; move along.</P>
</BODY>
</HTML>

(Running it through the debugCallback will show the incorrect parsing.

Discussion

Michael Schlenker - 2006-12-01

Logged In: YES
user_id=302287
Originator: NO

Wrong. The above is incorrect HTML. The JavaScript is incorrect, a < has always to be quoted as < unless it is inside a CDATA section.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Scott Cargo - 2006-12-01

Logged In: YES
user_id=69099
Originator: YES

I don't see anything in http://www.w3.org/TR/html4/interact/scripts.html that would indicate that. Can you provide a reference? None of the examples that I have seen on the web follow such a convention. (The CDATA section is deprecated, according the the W3 page above.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Schlenker - 2006-12-01

Logged In: YES
user_id=302287
Originator: NO

Your right its valid, but there are some problems lurking in that area, see:
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

It works because the DTD for HTML4 declares <script> as having a CDATA value, but may blow up in apps that don't fully respect the DTD. So its dangerous ground...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Scott Cargo - 2006-12-01

Logged In: YES
user_id=69099
Originator: YES

I was able to work around the problem by adding spaces around the < in the files I had access to (which would not always be the case). Pragmatically, I don't know what the fix should be, but handling of SCRIPT might need to be more special than it is.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Scott Cargo - 2006-12-02

Logged In: YES
user_id=69099
Originator: YES

Looking at your reference, it's pretty clear that the script tag cannot end until a </[a-zA-Z] is found. Looks like the recognizer needs to be a bit more discriminating.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Scott Cargo - 2008-02-15

Logged In: YES
user_id=69099
Originator: YES

This problem is now more urgent than it had been before. I'm working with JavaScript that is writing the HTML as it runs, including adding new <SCRIPT>...</SCRIPT> code inside of JavaScript document.write statements. htmlparse is getting seriously confused inside the JavaScript, and the resulting nodes are truly mucking up my parse tree. Some nodes aren't even at the right level any more, which makes my post processing lose them, which is very bad news for my application.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: