From: Patrick R. <pg...@vt...> - 2003-06-27 20:00:20
|
Hi folks, I just installed htdig-3.2.0b4-20030622, and discovered that it's not correctly handling Disallow: patterns from my robots.txt file. (I'm hoping this is the correct list to post this!) I have these lines in my robots.txt: User-agent: * Disallow: /WebObjects/ In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. However, when I rundig -vvv, it tells me that URLs like the following are rejected due to being "forbidden by server robots.txt": href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah This shouldn't happen. It should only be rejecting URLs *starting* with "/WebObjects/" (at least, that's my interpretation of what I read at http://www.robotstxt.org/wc/norobots.html). If I remove the "Disallow: /WebObjects/" line from robots.txt and rerun rundig, it now indexes those URLs. I never had this problem in 3.1.6. Has something changed? -- Patrick Robinson AHNR Info Technology, Virginia Tech pg...@vt... |
From: Jim C. <li...@yg...> - 2003-06-29 07:45:39
Attachments:
robots.patch
|
On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote: > I just installed htdig-3.2.0b4-20030622, and discovered that it's not > correctly handling Disallow: patterns from my robots.txt file. (I'm > hoping this is the correct list to post this!) > > I have these lines in my robots.txt: > User-agent: * > Disallow: /WebObjects/ > > In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. > However, when I rundig -vvv, it tells me that URLs like the following > are rejected due to being "forbidden by server robots.txt": > href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah I am seeing the same behavior in the current CVS code. As currently implemented, URL's are being checked for any occurrence of the disallow string, without regard to location within the URL. > This shouldn't happen. It should only be rejecting URLs *starting* > with "/WebObjects/" (at least, that's my interpretation of what I read > at http://www.robotstxt.org/wc/norobots.html). I agree that this behavior does not seem to match that specified by the standard. > I never had this problem in 3.1.6. Has something changed? I believe some of the related code changed with the introduction of new regex support. As it currently stands, the code is comparing the disallow against the full URL, rather than just the path, and it is not anchoring the comparison. In case you want to give it a try, I am attaching a patch that seems to correct the behavior of the robots code. I won't claim to have any deep insight into this part of the code, so no guarantees and all of that. Jim |
From: Gilles D. <gr...@sc...> - 2003-07-08 22:30:41
|
According to Jim Cole: > On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote: > > I just installed htdig-3.2.0b4-20030622, and discovered that it's not > > correctly handling Disallow: patterns from my robots.txt file. (I'm > > hoping this is the correct list to post this!) > > > > I have these lines in my robots.txt: > > User-agent: * > > Disallow: /WebObjects/ > > > > In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. > > However, when I rundig -vvv, it tells me that URLs like the following > > are rejected due to being "forbidden by server robots.txt": > > href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah > > I am seeing the same behavior in the current CVS code. As currently > implemented, URL's are being checked for any occurrence of the disallow > string, without regard to location within the URL. > > > This shouldn't happen. It should only be rejecting URLs *starting* > > with "/WebObjects/" (at least, that's my interpretation of what I read > > at http://www.robotstxt.org/wc/norobots.html). > > I agree that this behavior does not seem to match that specified by the > standard. Correct. This has been reported before, and possible solutions discussed, but nobody followed through with implementing one. > > I never had this problem in 3.1.6. Has something changed? > > I believe some of the related code changed with the introduction of new > regex support. As it currently stands, the code is comparing the > disallow against the full URL, rather than just the path, and it is not > anchoring the comparison. Correct again. Either anchoring the comparison, or going back to using StringMatch instead of Regex, is the solution, but in either case, you must be sure you're always looking at only the path portion of the URL, not the full URL as the 3.2 code does now. > In case you want to give it a try, I am attaching a patch that seems to > correct the behavior of the robots code. I won't claim to have any deep > insight into this part of the code, so no guarantees and all of that. The problem with that patch is it seems to miss the case of IsDisallowed called by Server::push(), so there it would end up checking the full URL against the anchored patterns for the path, and you'd never get a match. Unless the tests in Retriever::IsValidURL() pre-screen all cases before attempting a push(), I think some disallowed URLs could slip through the cracks. A more self-contained fix is below. It sidesteps the whole issue by making a regex pattern that can match the whole URL, so minimal code changes are needed. I don't know how efficient this ends up being, though. I also haven't tested this beyond making sure the full pattern works in egrep, so please test this patch carefully before using. I'll await feedback before committing it. --- htdig/Server.cc.orig 2003-06-24 15:40:11.000000000 -0500 +++ htdig/Server.cc 2003-07-08 17:16:18.000000000 -0500 @@ -316,9 +316,13 @@ void Server::robotstxt(Document &doc) if (*rest) { if (pattern.length()) - pattern << '|' << rest; - else - pattern = rest; + pattern << '|'; + while (*rest) + { + if (strchr("^.[$()|*+?{\\", *rest)) + pattern << '\\'; + pattern << *rest++; + } } } // @@ -332,7 +336,9 @@ void Server::robotstxt(Document &doc) if (debug > 1) cout << "Pattern: " << pattern << endl; - _disallow.set(pattern, config->Boolean("case_sensitive")); + String fullpatt = "^[^:]*://[^/]*("; + fullpatt << pattern << ')'; + _disallow.set(fullpatt, config->Boolean("case_sensitive")); } -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |