From: Reini U. <ru...@x-...> - 2002-10-21 12:19:06
|
Johnny L. Wales schrieb: > I was looking around the sourceforge page and noticed that there's an open > task to write a robots.txt file which will prevent a few pages from being > indexed. > > Maybe instead, we should include tags like this on pages we don't want > indexed: > <META NAME="ROBOTS" CONTENT="NOINDEX"> > > And, if you want the robot to stop following links on this page, you add > this to it: > <META NAME="ROBOTS" CONTENT="NOFOLLOW"> > > That should get everything you need to do done, right? We already use the robots meta tag. The problem is that some robots ignore these tags and robots.txt also. So the only solution will be to block these. ward's wiki uses a timeout. my first patch was based on the $REMOTE_HOST and $HTTP_USER_AGENT. I had this: $badrobots = array ('gw01.webtop.com', '202.102.65.191', '202.111.8.102', // '202.39.29.102', HTTrack 2.0x // '212.182.4.121' HTTrack 2.0x '61.132.57.226', 'lgdx06atm.lg.ehu.es', // reported falsely as Mozilla ); $badagentsre = '/(WebZIP)|(Teleport Pro)|(Googlebot)|(DigExt)|(FAST-WebCrawler)|(Wget)|(Mercator-1.2)|(HTTrack)|(Openfind)/'; // good robots: FAST-WebCrawler, TridentSpider3 This should be an optional configuration item. -- Reini Urban http://xarch.tu-graz.ac.at/home/rurban/ |