[Grub-general] URL crawling (was ip address)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> More importantly, though, it probably needs to limit the number of URLs on
> a given web server crawled by each grub client in a specific time period.

Knowing what "server" a url is on would be difficult because then the
grub master would have to keep track of the ip resolutions for each url,
which often changes and can consume a lot of resources.  Just tracking
the hostname should be enough for this application.  Any server that has
hundreds of domains on it should be beefy enough to handle a few crawls
at a time, one to each domain, and that shouldn't be a problem (as an
isp admin, I do that myself once every 30 seconds monitoring anyway for
>1000 domains)

> So perhaps each grubber can crawl (max(5% of all know URLs on that site,
> 100 URLs)) from a given web server in a 24 hour period.  I made up these
> specific numbers, but you get the idea -- use some kind of maximum metric
> that each grub client will crawl in a given period of time.

That would be a good idea (imho at least) .. The number would have to be
much higher though .. a site with 10k urls in the database would take
too long to complete to be useful.

> This allows an entire web site to be crawled in that period of time -- so
> you can still get fairly accurate, up-to-date stats -- but each URL on the
> site will only be crawled *once* (max) per time period, and by potentially
> many different crawlers so that no one grub client is identified as a DoS
> agent.

One thing that would be ideal is if the server/scheduler provided urls
to clients in sets that were specifically generated instead of just
spewing out the next 500 in the table (or the table was generated with
the client sets in mind).  This would allow for some nifty checking to
be added for crawl limiting. For example, each packet to send to a
client has at most 20 urls for a specific hostname, one every 15 urls.  

> That, paired with a minimum-time-before-recrawling metric (say, each URL
> doesn't need to be re-crawled for at least 2 days, or perhaps something
> more intelligent, such as URLs that don't change for a while get
> progressively longer periods between re-crawling, etc.), would go a long
> way to ensuring not to penalize people for running the grub client by
> getting cease and desist notices.

The idea of a minimum time for recrawl is another great idea ... right
now I'm seeing the url list cycle about once a day... if the crawlers
were busy working on finding new urls instead of recrawling unchanged
urls for the 2nd time that day it would be a lot better.  The idea of
backing off the crawls based on the url unchanging could be bad though
... a url that stays static for 3 weeks and is backed off could take a
few days extra to update .. and that is an example of many sites on the
net.  Unless the threshold was a couple months or something it wouldn't
be too useful.  

One thing that would be useful is discovery of dynamically generated
urls and backing them off.  More and more sites, especially
geocities/yahoo hosted and other dynamic banner and fluff sites are
going to change every time you hit them, because of the change in
advertising made on each hit.  Backing off some of these sites or
flagging them somehow to not be only monitored, or something would free
crawlers up a bit.  Since the goal of grub is not to index these pages,
but only to determine of a site has been updated, grub could just return
these urls to the outgoing feed every x hours and keep the crawlers busy
doing something else.  Determining a site like this could be just having
a url scheduled in 10 different packets.  If they all come back with a
different CRC you've got one.  Grubdex it, flag it for crawling only
every week (for finding new urls) and there ya go.  That would eliminate
a large percentage of the urls being crawled every day.  There would be
a chance that a new unseen link could have been posted on the page
between crawls, but until there is a huge crawler base there just won't
be time to crawl all those.  

Lowell