Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#869 parse gets confused inside script tags

open
htmlparse (25)
5
2006-11-30
2006-11-30
No

The following example shows how the parser in htmlparse gets confused by the JavaScript expression i<w..., incorrectly recognizing it as a tag.

<HTML>
<HEAD>
<TITLE>Demonstrate Bug</TITLE>
<SCRIPT LANGUAGE="JavaScript">
function resetTheForm()
{
var i;

// Make sure the forms object exists
if(window.document.forms)
{
// Reset each form
for(i=0; i<window.document.forms.length; i++)
window.document.forms[i].reset();
}
}
</SCRIPT>
</HEAD>
<BODY>
<P>Nothing interesting here; move along.</P>
</BODY>
</HTML>

(Running it through the debugCallback will show the incorrect parsing.

Discussion

  • Logged In: YES
    user_id=302287
    Originator: NO

    Wrong. The above is incorrect HTML. The JavaScript is incorrect, a < has always to be quoted as &lt; unless it is inside a CDATA section.

     
  • Logged In: YES
    user_id=69099
    Originator: YES

    I don't see anything in http://www.w3.org/TR/html4/interact/scripts.html that would indicate that. Can you provide a reference? None of the examples that I have seen on the web follow such a convention. (The CDATA section is deprecated, according the the W3 page above.)

     
  • Logged In: YES
    user_id=302287
    Originator: NO

    Your right its valid, but there are some problems lurking in that area, see:
    http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

    It works because the DTD for HTML4 declares <script> as having a CDATA value, but may blow up in apps that don't fully respect the DTD. So its dangerous ground...

     
  • Logged In: YES
    user_id=69099
    Originator: YES

    I was able to work around the problem by adding spaces around the < in the files I had access to (which would not always be the case). Pragmatically, I don't know what the fix should be, but handling of SCRIPT might need to be more special than it is.

     
  • Logged In: YES
    user_id=69099
    Originator: YES

    Looking at your reference, it's pretty clear that the script tag cannot end until a </[a-zA-Z] is found. Looks like the recognizer needs to be a bit more discriminating.

     
  • Logged In: YES
    user_id=69099
    Originator: YES

    This problem is now more urgent than it had been before. I'm working with JavaScript that is writing the HTML as it runs, including adding new <SCRIPT>...</SCRIPT> code inside of JavaScript document.write statements. htmlparse is getting seriously confused inside the JavaScript, and the resulting nodes are truly mucking up my parse tree. Some nodes aren't even at the right level any more, which makes my post processing lose them, which is very bad news for my application.