From: Lowell H. <lha...@vi...> - 2001-07-16 23:14:30
|
> More importantly, though, it probably needs to limit the number of URLs on > a given web server crawled by each grub client in a specific time period. Knowing what "server" a url is on would be difficult because then the grub master would have to keep track of the ip resolutions for each url, which often changes and can consume a lot of resources. Just tracking the hostname should be enough for this application. Any server that has hundreds of domains on it should be beefy enough to handle a few crawls at a time, one to each domain, and that shouldn't be a problem (as an isp admin, I do that myself once every 30 seconds monitoring anyway for >1000 domains) > So perhaps each grubber can crawl (max(5% of all know URLs on that site, > 100 URLs)) from a given web server in a 24 hour period. I made up these > specific numbers, but you get the idea -- use some kind of maximum metric > that each grub client will crawl in a given period of time. That would be a good idea (imho at least) .. The number would have to be much higher though .. a site with 10k urls in the database would take too long to complete to be useful. > This allows an entire web site to be crawled in that period of time -- so > you can still get fairly accurate, up-to-date stats -- but each URL on the > site will only be crawled *once* (max) per time period, and by potentially > many different crawlers so that no one grub client is identified as a DoS > agent. One thing that would be ideal is if the server/scheduler provided urls to clients in sets that were specifically generated instead of just spewing out the next 500 in the table (or the table was generated with the client sets in mind). This would allow for some nifty checking to be added for crawl limiting. For example, each packet to send to a client has at most 20 urls for a specific hostname, one every 15 urls. > That, paired with a minimum-time-before-recrawling metric (say, each URL > doesn't need to be re-crawled for at least 2 days, or perhaps something > more intelligent, such as URLs that don't change for a while get > progressively longer periods between re-crawling, etc.), would go a long > way to ensuring not to penalize people for running the grub client by > getting cease and desist notices. The idea of a minimum time for recrawl is another great idea ... right now I'm seeing the url list cycle about once a day... if the crawlers were busy working on finding new urls instead of recrawling unchanged urls for the 2nd time that day it would be a lot better. The idea of backing off the crawls based on the url unchanging could be bad though ... a url that stays static for 3 weeks and is backed off could take a few days extra to update .. and that is an example of many sites on the net. Unless the threshold was a couple months or something it wouldn't be too useful. One thing that would be useful is discovery of dynamically generated urls and backing them off. More and more sites, especially geocities/yahoo hosted and other dynamic banner and fluff sites are going to change every time you hit them, because of the change in advertising made on each hit. Backing off some of these sites or flagging them somehow to not be only monitored, or something would free crawlers up a bit. Since the goal of grub is not to index these pages, but only to determine of a site has been updated, grub could just return these urls to the outgoing feed every x hours and keep the crawlers busy doing something else. Determining a site like this could be just having a url scheduled in 10 different packets. If they all come back with a different CRC you've got one. Grubdex it, flag it for crawling only every week (for finding new urls) and there ya go. That would eliminate a large percentage of the urls being crawled every day. There would be a chance that a new unseen link could have been posted on the page between crawls, but until there is a huge crawler base there just won't be time to crawl all those. Lowell |