Re: [htdig-dev] htdig-3.2.0b4-20030622 bug in robots.txt processing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote:

> I just installed htdig-3.2.0b4-20030622, and discovered that it's not 
> correctly handling Disallow: patterns from my robots.txt file.  (I'm 
> hoping this is the correct list to post this!)
>
> I have these lines in my robots.txt:
> User-agent: *
> Disallow: /WebObjects/
>
> In my config file, I do NOT exclude /cgi-bin/ via exclude_urls.  
> However, when I rundig -vvv, it tells me that URLs like the following 
> are rejected due to being "forbidden by server robots.txt":
> href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah

I am seeing the same behavior in the current CVS code. As currently 
implemented, URL's are being checked for any occurrence of the disallow 
string, without regard to location within the URL.

> This shouldn't happen.  It should only be rejecting URLs *starting* 
> with "/WebObjects/" (at least, that's my interpretation of what I read 
> at http://www.robotstxt.org/wc/norobots.html).

I agree that this behavior does not seem to match that specified by the 
standard.

> I never had this problem in 3.1.6.  Has something changed?

I believe some of the related code changed with the introduction of new 
regex support. As it currently stands, the code is comparing the 
disallow against the full URL, rather than just the path, and it is not 
anchoring the comparison.

In case you want to give it a try, I am attaching a patch that seems to 
correct the behavior of the robots code. I won't claim to have any deep 
insight into this part of the code, so no guarantees and all of that.

Jim