Re: [htdig-dev] Possible Parser Bug (was Re: [htdig] reading htdig -vvv output)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

According to Geoff Hutchison:
> On Friday, March 8, 2002, at 05:20  PM, Jim Cole wrote:
> > It does look like there is a problem with the parser. If a '<'
> > occurs in a script element, it appears that the parser becomes
> > somewhat confused with regard to the remaining document content.
> > For example
> 
> Yes, this sounds like a bug to me. Actually, the <script> sections and 
> probably other sections as well should be simply skipped by the parser. 
> Right now the code does this:
> 
> >         case 29:        // "script"
> >             noindex |= TAGscript;
> >             nofollow |= TAGscript;
> >             break;
> 
> In short, the parser doesn't *index* the bits inside <script></script> 
> tags, but it does *look* at them. So it hit that "<" character and 
> figured it was a new tag.
> 
> I would think that we want to treat <script> and probably <style> 
> sections like comments--find the ending tag and completely ignore 
> everything inside.

I think your assessment of the problem, and proposed solution, are
both bang-on.  The stuff between the <script> and </script> tag should
be stripped out entirely and not parsed for HTML tags.

Of course, you can avoid this problem in your HTML if you properly put
inline JavaScript code inside an HTML comment.  E.g.:

<script>
<!--

JavaScript code here

// -->
</script>

I'm amazed at how frequently people/programs fail to do this.  It's
what you're supposed to do to avoid problems with non-JavaScript-aware
web clients.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930