Re: [Grub-general] URL crawling (was ip address)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Lowell Hamilton wrote:
> Jeff Squyres wrote:
> > That, paired with a minimum-time-before-recrawling metric
> > (say, each URL doesn't need to be re-crawled for at least 2
> > days, or perhaps something more intelligent, such as URLs that
> > don't change for a while get progressively longer periods
> > between re-crawling, etc.), would go a long way to ensuring
> > not to penalize people for running the grub client by
> > getting cease and desist notices.
> The idea of a minimum time for recrawl is another great idea ...
> right now I'm seeing the url list cycle about once a day... if
> the crawlers were busy working on finding new urls instead of
> recrawling unchanged urls for the 2nd time that day it would be
> a lot better.  The idea of backing off the crawls based on the
> url unchanging could be bad though ... a url that stays static
> for 3 weeks and is backed off could take a few days extra to
> update .. and that is an example of many sites on the
IMHO first of all, grub should respect the HTTP header (forgot
which one, sorry) saying how long the URI should be cached. Then
we can see whether it helps - although I'm sceptical about
the majority of webmasters specifically setting caching to limit
load on their servers, perhaps at least those big sites take
care... Even if it practically doesn't help, at least we can respond to
the cease & desist with "since your page says it's
fresh every minute, we want to see it every minute" (don't try
this at home :-) ).

> One thing that would be useful is discovery of dynamically
> generated urls and backing them off.  More and more sites,
> especially geocities/yahoo hosted and other dynamic banner and
One alternative would be never index uncacheable content - but I 
would certainly want to see which percentage of the web content is 
uncacheable before proposing to skip it all...

> advertising made on each hit.  Backing off some of these sites
> or flagging them somehow to not be only monitored, or something
> would free crawlers up a bit.  Since the goal of grub is not to
> index these pages, but only to determine of a site has been
> updated, grub could just return these urls to the outgoing feed
> every x hours and keep the crawlers busy doing something else. 
> Determining a site like this could be just having a url
> scheduled in 10 different packets.  If they all come back with a
Perhaps, but second-guessing stupid or antagonistic webmasters
should IMHO come *after* cooperation with those who are willing
and able to cooperate...

	Bye
		Vasek