From: Jeff S. <jsq...@ls...> - 2001-07-16 22:37:55
|
On Mon, 16 Jul 2001, Lowell Hamilton wrote: > Another solution, but difficult solution, would be reorganize the > tables on the master, seperating out the hostname and path, and setup > the scheduler to limit the number of each hostname that can be > scheduled in a certain period. That would eliminate the problems, and > also allow better results to be returned in the future (i.e. you could > generate reports like # of urls for a domain, total hostnames, etc). > There are probably better ways to do it too ... as soon as someone > gets a full database dump from google we'll know how <smirk> Hear, hear. More importantly, though, it probably needs to limit the number of URLs on a given web server crawled by each grub client in a specific time period. So perhaps each grubber can crawl (max(5% of all know URLs on that site, 100 URLs)) from a given web server in a 24 hour period. I made up these specific numbers, but you get the idea -- use some kind of maximum metric that each grub client will crawl in a given period of time. This allows an entire web site to be crawled in that period of time -- so you can still get fairly accurate, up-to-date stats -- but each URL on the site will only be crawled *once* (max) per time period, and by potentially many different crawlers so that no one grub client is identified as a DoS agent. That, paired with a minimum-time-before-recrawling metric (say, each URL doesn't need to be re-crawled for at least 2 days, or perhaps something more intelligent, such as URLs that don't change for a while get progressively longer periods between re-crawling, etc.), would go a long way to ensuring not to penalize people for running the grub client by getting cease and desist notices. {+} Jeff Squyres {+} sq...@cs... {+} Perpetual Obsessive Notre Dame Student Craving Utter Madness {+} "I came to ND for 4 years and ended up staying for a decade" |