From: Jim C. <li...@yg...> - 2003-06-29 07:45:39
|
On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote: > I just installed htdig-3.2.0b4-20030622, and discovered that it's not > correctly handling Disallow: patterns from my robots.txt file. (I'm > hoping this is the correct list to post this!) > > I have these lines in my robots.txt: > User-agent: * > Disallow: /WebObjects/ > > In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. > However, when I rundig -vvv, it tells me that URLs like the following > are rejected due to being "forbidden by server robots.txt": > href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah I am seeing the same behavior in the current CVS code. As currently implemented, URL's are being checked for any occurrence of the disallow string, without regard to location within the URL. > This shouldn't happen. It should only be rejecting URLs *starting* > with "/WebObjects/" (at least, that's my interpretation of what I read > at http://www.robotstxt.org/wc/norobots.html). I agree that this behavior does not seem to match that specified by the standard. > I never had this problem in 3.1.6. Has something changed? I believe some of the related code changed with the introduction of new regex support. As it currently stands, the code is comparing the disallow against the full URL, rather than just the path, and it is not anchoring the comparison. In case you want to give it a try, I am attaching a patch that seems to correct the behavior of the robots code. I won't claim to have any deep insight into this part of the code, so no guarantees and all of that. Jim |