Re: [Phpwiki-talk] Suggestion on a 1.3 task

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Johnny L. Wales schrieb:
> I was looking around the sourceforge page and noticed that there's an open
> task to write a robots.txt file which will prevent a few pages from being
> indexed. 
> 
> Maybe instead, we should include tags like this on pages we don't want
> indexed:
> <META NAME="ROBOTS" CONTENT="NOINDEX">
> 
> And, if you want the robot to stop following links on this page, you add
> this to it:
> <META NAME="ROBOTS" CONTENT="NOFOLLOW">
> 
> That should get everything you need to do done, right?

We already use the robots meta tag. The problem is that some robots 
ignore these tags and robots.txt also. So the only solution will be to 
block these. ward's wiki uses a timeout. my first patch was based on the 
$REMOTE_HOST and $HTTP_USER_AGENT.

I had this:

$badrobots = array ('gw01.webtop.com',
	       '202.102.65.191',
	       '202.111.8.102',
	       // '202.39.29.102', HTTrack 2.0x
	       // '212.182.4.121' HTTrack 2.0x
	       '61.132.57.226',
	       'lgdx06atm.lg.ehu.es', // reported falsely as Mozilla
	       );
$badagentsre = '/(WebZIP)|(Teleport 
Pro)|(Googlebot)|(DigExt)|(FAST-WebCrawler)|(Wget)|(Mercator-1.2)|(HTTrack)|(Openfind)/';
// good robots: FAST-WebCrawler, TridentSpider3

This should be an optional configuration item.
-- 
Reini Urban
http://xarch.tu-graz.ac.at/home/rurban/