Share

tcllib

Tracker: Bugs

5 parse gets confused inside script tags - ID: 1606391
Last Update: Comment added ( escargo )

The following example shows how the parser in htmlparse gets confused by
the JavaScript expression i<w..., incorrectly recognizing it as a tag.

<HTML>
<HEAD>
<TITLE>Demonstrate Bug</TITLE>
<SCRIPT LANGUAGE="JavaScript">
function resetTheForm()
{
var i;

// Make sure the forms object exists
if(window.document.forms)
{
// Reset each form
for(i=0; i<window.document.forms.length; i++)
window.document.forms[i].reset();
}
}
</SCRIPT>
</HEAD>
<BODY>
<P>Nothing interesting here; move along.</P>
</BODY>
</HTML>

(Running it through the debugCallback will show the incorrect parsing.


David Scott Cargo ( escargo ) - 2006-11-30 22:25

5

Open

None

Andreas Kupries

htmlparse

None

Public


Comments ( 6 )

Date: 2008-02-15 21:19
Sender: escargo


This problem is now more urgent than it had been before. I'm working with
JavaScript that is writing the HTML as it runs, including adding new
<SCRIPT>...</SCRIPT> code inside of JavaScript document.write statements.
htmlparse is getting seriously confused inside the JavaScript, and the
resulting nodes are truly mucking up my parse tree. Some nodes aren't even
at the right level any more, which makes my post processing lose them,
which is very bad news for my application.


Date: 2006-12-02 03:27
Sender: escargo


Looking at your reference, it's pretty clear that the script tag cannot
end until a </[a-zA-Z] is found. Looks like the recognizer needs to be a
bit more discriminating.


Date: 2006-12-01 19:56
Sender: escargo


I was able to work around the problem by adding spaces around the < in the
files I had access to (which would not always be the case). Pragmatically,
I don't know what the fix should be, but handling of SCRIPT might need to
be more special than it is.


Date: 2006-12-01 19:35
Sender: mic42


Your right its valid, but there are some problems lurking in that area,
see:
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

It works because the DTD for HTML4 declares <script> as having a CDATA
value, but may blow up in apps that don't fully respect the DTD. So its
dangerous ground...


Date: 2006-12-01 19:23
Sender: escargo


I don't see anything in http://www.w3.org/TR/html4/interact/scripts.html
that would indicate that. Can you provide a reference? None of the examples
that I have seen on the web follow such a convention. (The CDATA section is
deprecated, according the the W3 page above.)


Date: 2006-12-01 18:19
Sender: mic42


Wrong. The above is incorrect HTML. The JavaScript is incorrect, a < has
always to be quoted as &lt; unless it is inside a CDATA section.


Attached File

No Files Currently Attached

Change

No changes have been made to this artifact.