From: Keith R. <ke...@ka...> - 2009-09-06 15:42:27
|
On Sun, 6 Sep 2009, ed...@ed... wrote: > To: php...@li... > From: ed...@ed... > Subject: Re: [PHPEclipse-devel] Trac issues once again > > On Sun, 06 Sep 2009 07:32:23 +0100, Lester Caine <le...@ls...> > wrote: >> ed...@ed... wrote: >>> The msnbot has decided to index phpeclipse.com website. This is killing >>> the server and causing trac to consume all memory in the box. I have >>> given the box 3Gig of RAM, it worked for a while but now is being a >>> problem again. I have a few thoughts on solutions for this. (i did >>> enable a memory check we created some time ago to check the system >>> memory and if it fell below a level to kill all apache processes). >>> >>> 1. Put another server up and load balance the two. I would move trac >>> from using SQLite to PGSql so i could get both servers to read the >>> database, however this may be possible using sqlite and an nfs or iscsi >>> share. Input welcomed. >>> >>> 2. Move Trac website back to SF. They support trac now. Don't know how >>> much work this would be, and don't know if this is the correct solution. > >>> We may just be moving the problem around. >>> >>> 3. Move away from trac to some other tool that knows how to manage it's >>> memory. I have no idea on this one. >>> >>> 4. Just hope our little script works and leave it as is. >>> >>> That's all i have. Just let me know what you think would be a good >>> solution for this problem. Also the httpd processes consume huge amounts > >>> of CPU. I could give the box another CPU. I don't know how much that >>> would help. It would help, just how much of a help would it be? Input is > >>> wanted. >> >> Or just disable msnbot via robots.txt? >> >> Half the trouble seems to be that these search engines trawl EVERYTHING, >> so every historic link with the result that every single update has a >> search link created. >> At one point I had 60 search engines all trawling my server, once one of >> the big boys lists you, all of the others join the band wagon :( So >> although it would possibly be nice that a search on one of them produces >> a result - is the cost of supporting them in computer power and >> bandwidth worth it? >> >> My local machines simply has a deny all robots.txt, while the ones at >> 1&1 have just a few select ones allowed nowadays - and nofollow helps on >> and historic stuff >> http://blog.searchenginewatch.com/050118-204728 > > I have added a robots.txt file. It has the following content > User-agent: * > Disallow: /browser > Disallow: /log > Disallow: /changeset > Disallow: /report > Disallow: /newticket > Disallow: /search > > I also had to add the folloring to the conf file for phpeclipse domain > <Location "/robots.txt"> > SetHandler None > </Location> > > This is after the mod_python stuff. > > I checked this internal before i allowed the firewall to > pass traffic back to the site, and it worked. > > However i see the bot getting the robots.txt file > 65.55.209.70 - - [06/Sep/2009:09:36:52 -0500] "GET > /robots.txt HTTP/1.0" 200 126 "-" "msnbot/1.1 > (+http://search.msn.com/msnbot.htm)" Now i won't see that > host come around again, but i get the others 65.55.106.162 > - - [06/Sep/2009:09:51:47 -0500] "GET > /browser/trunk/net.sourceforge.phpeclipse/src/net/sourceforge/phpeclipse/phpeditor/php/HTMLCompletionProcessor.java?rev=117 > HTTP/1.0" 200 37604 "-" "msnbot/2.0b > (+http://search.msn.com/msnbot.htm)" 65.55.106.159 - - > [06/Sep/2009:09:51:48 -0500] "GET > /browser/trunk/net.sourceforge.phpeclipse.phpmanual.htmlparser/build.properties?rev=1582 > HTTP/1.0" 200 13347 "-" "msnbot/2.0b > (+http://search.msn.com/msnbot.htm)" 65.55.106.182 - - > [06/Sep/2009:09:51:49 -0500] "GET > /browser/trunk/net.sourceforge.phpeclipse.debug.core?rev=1441 > HTTP/1.0" 200 21769 "-" "msnbot/2.0b > (+http://search.msn.com/msnbot.htm)" > > i guess it will take some time for them to get the robots.txt file. Until > then phpeclipse.com will be slow or down completely. > > However if i have configured the robots.txt or apache setup wrong let me > know. You mentioned blocking msnbot at the firewall, which is a good idea. There's a way to limit requests via IPTables, that should work. It blocks connections according to the number of requests over a particualr time to a certain port, from a particular IP address. This is a general rule I wrote to stop any IP address from DOSing my site: #------------------------------------------------------# # chain to limit the average number of new Apache connections # to 20 per second. i.e. an average of 200 new connections # every 10 seconds - enough for me now! # this limit will kick in when limit-burst of 30 is reached! #------------------------------------------------------# iptables -N throttle_port_80 iptables -I throttle_port_80 -i eth0 -p tcp --syn --dport 80 \ -m limit --limit 5/second --limit-burst 10 -j ACCEPT See http://www.debian-administration.org/articles/187 for an intro to limiting connections with IPTables. For more info You can also search Google for: iptables limit I guess that restricting msnbot with a limit-based firewall rule would be your first line of defence. This should take the load off apache, and also allow normal users in OK. Writing apache rules to stop msnbot accessing parts of the site puts the load back onto apache, which is probably not the best choice. However, if you want to stop apache from serving certain parts of your site to msnbot as well, use a directory container like: <Directory /full/path/to/phpeclipse-site/directory/to/block> Options None Order deny,allow Deny from MSNBOT Allow from all </Directory> Where MSNBOT is the IP address(es) used by msnbot. Please see here: http://httpd.apache.org/docs/2.0/mod/mod_access.html#deny for full details of how to use the deny directive. HTH Keith Roberts ----------------------------------------------------------------- Websites: http://www.php-debuggers.net http://www.karsites.net http://www.raised-from-the-dead.org.uk All email addresses are challenge-response protected with TMDA [http://tmda.net] ----------------------------------------------------------------- |