|
From: Martin <mar...@un...> - 2002-10-22 16:08:24
|
Since it was bounced and after resend I got no replies for 10 days, I'm trying to post it to dev@ list... ----- Forwarded message from Martin Ma=E8ok <mar...@un...> -= ---- Date: Thu, 10 Oct 2002 09:27:13 +0200 From: Martin Ma=E8ok <mar...@un...> To: htd...@ht... Subject: robots.txt URL matching (OK in 3.1.x, bad in 3.2.0b) Hi, I've (probably) found a bug (with a little help from wwwoffle author "Andrew M. Bishop" <amb(at)gedanken.demon.co.uk>) in ht://Dig 3.2.0b4-072201 (from Mandrake package) in robots.txt URL matching. When you disallow "/foo", htdig then rejects "/bar/foo" but according to http://www.robotstxt.org/wc/norobots.html it should reject only URLs _starting_ with (not just containing) disallowed string. I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked well but after upgrading to 3.2.0b4-072201 it broke. The cached pages are under "/search/index" directory and "/index" is disallowed. You can see that 3.2.0b rejects "/search/index" in debug output: ------------------- Robots.txt line: Disallow: /index Found 'disallow' line: /index Pattern: /control|/configuration|/refresh|/monitor|/index [...] pushing http://localhost:8080/search/start3.html +href: http://localhost:8080/search/index/ (The WWWOFFLE searchable index o= f all cached web pages) Rejected: forbidden by server robots.txt! ------------------- I'm sorry for not sending a patch, I'm offline now and don't have the sources on my hdd (and dialup is expensive here through the day) but I think that it should be trivial to fix. Thanks a lot and have a nice day --=20 Martin Ma=E8ok http://underground.cz/ mar...@un... http://Xtrmntr.org/ORBman/ Reclaim your rights! - http://www.digitalspeech.org/ ----- End forwarded message ----- --=20 Martin Ma=E8ok |