Johnny L. Wales schrieb:
> I was looking around the sourceforge page and noticed that there's an open
> task to write a robots.txt file which will prevent a few pages from being
> indexed.
>
> Maybe instead, we should include tags like this on pages we don't want
> indexed:
> <META NAME="ROBOTS" CONTENT="NOINDEX">
>
> And, if you want the robot to stop following links on this page, you add
> this to it:
> <META NAME="ROBOTS" CONTENT="NOFOLLOW">
>
> That should get everything you need to do done, right?
We already use the robots meta tag. The problem is that some robots
ignore these tags and robots.txt also. So the only solution will be to
block these. ward's wiki uses a timeout. my first patch was based on the
$REMOTE_HOST and $HTTP_USER_AGENT.
I had this:
$badrobots = array ('gw01.webtop.com',
'202.102.65.191',
'202.111.8.102',
// '202.39.29.102', HTTrack 2.0x
// '212.182.4.121' HTTrack 2.0x
'61.132.57.226',
'lgdx06atm.lg.ehu.es', // reported falsely as Mozilla
);
$badagentsre = '/(WebZIP)|(Teleport
Pro)|(Googlebot)|(DigExt)|(FAST-WebCrawler)|(Wget)|(Mercator-1.2)|(HTTrack)|(Openfind)/';
// good robots: FAST-WebCrawler, TridentSpider3
This should be an optional configuration item.
--
Reini Urban
http://xarch.tu-graz.ac.at/home/rurban/
|