Re: [htdig-dev] Possible Parser Bug (was Re: [htdig] reading htdig -vvv output)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Friday, March 8, 2002, at 05:20  PM, Jim Cole wrote:

> It does look like there is a problem with the parser. If a '<'
> occurs in a script element, it appears that the parser becomes
> somewhat confused with regard to the remaining document content.
> For example

Yes, this sounds like a bug to me. Actually, the <script> sections and 
probably other sections as well should be simply skipped by the parser. 
Right now the code does this:

>         case 29:        // "script"
>             noindex |= TAGscript;
>             nofollow |= TAGscript;
>             break;

In short, the parser doesn't *index* the bits inside <script></script> 
tags, but it does *look* at them. So it hit that "<" character and 
figured it was a new tag.

I would think that we want to treat <script> and probably <style> 
sections like comments--find the ending tag and completely ignore 
everything inside.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/