|
From: Jeff S. <jsq...@ls...> - 2001-07-16 22:37:55
|
On Mon, 16 Jul 2001, Lowell Hamilton wrote:
> Another solution, but difficult solution, would be reorganize the
> tables on the master, seperating out the hostname and path, and setup
> the scheduler to limit the number of each hostname that can be
> scheduled in a certain period. That would eliminate the problems, and
> also allow better results to be returned in the future (i.e. you could
> generate reports like # of urls for a domain, total hostnames, etc).
> There are probably better ways to do it too ... as soon as someone
> gets a full database dump from google we'll know how <smirk>
Hear, hear.
More importantly, though, it probably needs to limit the number of URLs on
a given web server crawled by each grub client in a specific time period.
So perhaps each grubber can crawl (max(5% of all know URLs on that site,
100 URLs)) from a given web server in a 24 hour period. I made up these
specific numbers, but you get the idea -- use some kind of maximum metric
that each grub client will crawl in a given period of time.
This allows an entire web site to be crawled in that period of time -- so
you can still get fairly accurate, up-to-date stats -- but each URL on the
site will only be crawled *once* (max) per time period, and by potentially
many different crawlers so that no one grub client is identified as a DoS
agent.
That, paired with a minimum-time-before-recrawling metric (say, each URL
doesn't need to be re-crawled for at least 2 days, or perhaps something
more intelligent, such as URLs that don't change for a while get
progressively longer periods between re-crawling, etc.), would go a long
way to ensuring not to penalize people for running the grub client by
getting cease and desist notices.
{+} Jeff Squyres
{+} sq...@cs...
{+} Perpetual Obsessive Notre Dame Student Craving Utter Madness
{+} "I came to ND for 4 years and ended up staying for a decade"
|