From: Lowell H. <lha...@vi...> - 2001-07-05 18:50:08
|
I got a complaint today from an admin of cnn.com because my crawlers were smacking the site at 1200k/sec for 5 minutes. This kinda brings up the problem of randomizing again unfortunately. During a 1 hr period, it's almost guaranteed that cnn.com, news.com, encyclopeida.com, and encyklopedia.pl (or however it's spelled) will get hit with a few thousand crawls in a row each. On sites where it's a cgi that actually brings up the url from a database each returning a >20k file, that beats the site up pretty good. While limiting my bandwidth and number of crawlers would help a little bit, what happens when each of a hundred clients gets assigned a list of cnn.com urls, eaching hitting 2 a second .. that's ~200 hits/second to the site minumum. I've also had a couple complaints from people because of the useragent being WGet and not following robots.txt ... and one guy that was going to report the crawl to giac and cert thinking it was a new worm/virus. Since I run 5 crawlers on 4 different lines where I own the IP's so the complaints all come to me... but for other people using their dsl or what have you, the complaint goes to their isp... and many isp's will cut you off if any complaints of abuse are filed. Lowell |