The following example shows how the parser in htmlparse gets confused by the JavaScript expression i<w..., incorrectly recognizing it as a tag.
<HTML>
<HEAD>
<TITLE>Demonstrate Bug</TITLE>
<SCRIPT LANGUAGE="JavaScript">
function resetTheForm()
{
var i;
// Make sure the forms object exists
if(window.document.forms)
{
// Reset each form
for(i=0; i<window.document.forms.length; i++)
window.document.forms[i].reset();
}
}
</SCRIPT>
</HEAD>
<BODY>
<P>Nothing interesting here; move along.</P>
</BODY>
</HTML>
(Running it through the debugCallback will show the incorrect parsing.
Logged In: YES
user_id=302287
Originator: NO
Wrong. The above is incorrect HTML. The JavaScript is incorrect, a < has always to be quoted as < unless it is inside a CDATA section.
Logged In: YES
user_id=69099
Originator: YES
I don't see anything in http://www.w3.org/TR/html4/interact/scripts.html that would indicate that. Can you provide a reference? None of the examples that I have seen on the web follow such a convention. (The CDATA section is deprecated, according the the W3 page above.)
Logged In: YES
user_id=302287
Originator: NO
Your right its valid, but there are some problems lurking in that area, see:
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data
It works because the DTD for HTML4 declares <script> as having a CDATA value, but may blow up in apps that don't fully respect the DTD. So its dangerous ground...
Logged In: YES
user_id=69099
Originator: YES
I was able to work around the problem by adding spaces around the < in the files I had access to (which would not always be the case). Pragmatically, I don't know what the fix should be, but handling of SCRIPT might need to be more special than it is.
Logged In: YES
user_id=69099
Originator: YES
Looking at your reference, it's pretty clear that the script tag cannot end until a </[a-zA-Z] is found. Looks like the recognizer needs to be a bit more discriminating.
Logged In: YES
user_id=69099
Originator: YES
This problem is now more urgent than it had been before. I'm working with JavaScript that is writing the HTML as it runs, including adding new <SCRIPT>...</SCRIPT> code inside of JavaScript document.write statements. htmlparse is getting seriously confused inside the JavaScript, and the resulting nodes are truly mucking up my parse tree. Some nodes aren't even at the right level any more, which makes my post processing lose them, which is very bad news for my application.