Thread: [htdig-dev] htdig-3.2.0b4-20030622 bug in robots.txt processing

Brought to you by: angusgb, grdetil, lha, nealr, scherpbier

htdig-dev

[htdig-dev] htdig-3.2.0b4-20030622 bug in robots.txt processing

From: Patrick R. <pg...@vt...> - 2003-06-27 20:00:20

Hi folks,

I just installed htdig-3.2.0b4-20030622, and discovered that it's not 
correctly handling Disallow: patterns from my robots.txt file.  (I'm 
hoping this is the correct list to post this!)

I have these lines in my robots.txt:
User-agent: *
Disallow: /WebObjects/

In my config file, I do NOT exclude /cgi-bin/ via exclude_urls.  
However, when I rundig -vvv, it tells me that URLs like the following 
are rejected due to being "forbidden by server robots.txt":
href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah

This shouldn't happen.  It should only be rejecting URLs *starting* 
with "/WebObjects/" (at least, that's my interpretation of what I read 
at http://www.robotstxt.org/wc/norobots.html).

If I remove the "Disallow: /WebObjects/" line from robots.txt and rerun 
rundig, it now indexes those URLs.

I never had this problem in 3.1.6.  Has something changed?

--
Patrick Robinson
AHNR Info Technology, Virginia Tech
pg...@vt...

Re: [htdig-dev] htdig-3.2.0b4-20030622 bug in robots.txt processing

From: Jim C. <li...@yg...> - 2003-06-29 07:45:39

Attachments: robots.patch

On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote:

> I just installed htdig-3.2.0b4-20030622, and discovered that it's not 
> correctly handling Disallow: patterns from my robots.txt file.  (I'm 
> hoping this is the correct list to post this!)
>
> I have these lines in my robots.txt:
> User-agent: *
> Disallow: /WebObjects/
>
> In my config file, I do NOT exclude /cgi-bin/ via exclude_urls.  
> However, when I rundig -vvv, it tells me that URLs like the following 
> are rejected due to being "forbidden by server robots.txt":
> href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah

I am seeing the same behavior in the current CVS code. As currently 
implemented, URL's are being checked for any occurrence of the disallow 
string, without regard to location within the URL.

> This shouldn't happen.  It should only be rejecting URLs *starting* 
> with "/WebObjects/" (at least, that's my interpretation of what I read 
> at http://www.robotstxt.org/wc/norobots.html).

I agree that this behavior does not seem to match that specified by the 
standard.

> I never had this problem in 3.1.6.  Has something changed?

I believe some of the related code changed with the introduction of new 
regex support. As it currently stands, the code is comparing the 
disallow against the full URL, rather than just the path, and it is not 
anchoring the comparison.

In case you want to give it a try, I am attaching a patch that seems to 
correct the behavior of the robots code. I won't claim to have any deep 
insight into this part of the code, so no guarantees and all of that.

Jim

Re: [htdig-dev] htdig-3.2.0b4-20030622 bug in robots.txt processing

From: Gilles D. <gr...@sc...> - 2003-07-08 22:30:41

According to Jim Cole:
> On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote:
> > I just installed htdig-3.2.0b4-20030622, and discovered that it's not 
> > correctly handling Disallow: patterns from my robots.txt file.  (I'm 
> > hoping this is the correct list to post this!)
> >
> > I have these lines in my robots.txt:
> > User-agent: *
> > Disallow: /WebObjects/
> >
> > In my config file, I do NOT exclude /cgi-bin/ via exclude_urls.  
> > However, when I rundig -vvv, it tells me that URLs like the following 
> > are rejected due to being "forbidden by server robots.txt":
> > href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah
> 
> I am seeing the same behavior in the current CVS code. As currently 
> implemented, URL's are being checked for any occurrence of the disallow 
> string, without regard to location within the URL.
> 
> > This shouldn't happen.  It should only be rejecting URLs *starting* 
> > with "/WebObjects/" (at least, that's my interpretation of what I read 
> > at http://www.robotstxt.org/wc/norobots.html).
> 
> I agree that this behavior does not seem to match that specified by the 
> standard.

Correct.  This has been reported before, and possible solutions discussed,
but nobody followed through with implementing one.

> > I never had this problem in 3.1.6.  Has something changed?
> 
> I believe some of the related code changed with the introduction of new 
> regex support. As it currently stands, the code is comparing the 
> disallow against the full URL, rather than just the path, and it is not 
> anchoring the comparison.

Correct again.  Either anchoring the comparison, or going back to using
StringMatch instead of Regex, is the solution, but in either case, you
must be sure you're always looking at only the path portion of the URL,
not the full URL as the 3.2 code does now.

> In case you want to give it a try, I am attaching a patch that seems to 
> correct the behavior of the robots code. I won't claim to have any deep 
> insight into this part of the code, so no guarantees and all of that.

The problem with that patch is it seems to miss the case of IsDisallowed
called by Server::push(), so there it would end up checking the full URL
against the anchored patterns for the path, and you'd never get a match.
Unless the tests in Retriever::IsValidURL() pre-screen all cases before
attempting a push(), I think some disallowed URLs could slip through
the cracks.

A more self-contained fix is below.  It sidesteps the whole issue by
making a regex pattern that can match the whole URL, so minimal code
changes are needed.  I don't know how efficient this ends up being,
though.  I also haven't tested this beyond making sure the full pattern
works in egrep, so please test this patch carefully before using.
I'll await feedback before committing it.

--- htdig/Server.cc.orig	2003-06-24 15:40:11.000000000 -0500
+++ htdig/Server.cc	2003-07-08 17:16:18.000000000 -0500
@@ -316,9 +316,13 @@ void Server::robotstxt(Document &doc)
 	    if (*rest)
 	    {
 		if (pattern.length())
-		    pattern << '|' << rest;
-		else
-		    pattern = rest;
+		    pattern << '|';
+		while (*rest)
+		{
+		    if (strchr("^.[$()|*+?{\\", *rest))
+			pattern << '\\';
+		    pattern << *rest++;
+		}
 	    }
 	}
 	//
@@ -332,7 +336,9 @@ void Server::robotstxt(Document &doc)
     if (debug > 1)
 	cout << "Pattern: " << pattern << endl;
 		
-    _disallow.set(pattern, config->Boolean("case_sensitive"));
+    String	fullpatt = "^[^:]*://[^/]*(";
+    fullpatt << pattern << ')';
+    _disallow.set(fullpatt, config->Boolean("case_sensitive"));
 }
 
 

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)