Re: [Archive-access-discuss] snoozing issue with heritrix

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Best to ask Heritrix-specific questions on its project list:

http://tech.groups.yahoo.com/group/archive-crawler/

But also, a typical pattern for a focused crawl is for it to collect
URIs rapidly when there are many different sites to contact. But later,
once all URIs from smaller and fast/responsive sites have been
collected, only those sites that are large, slow, and/or unresponsive
are left. The rate of URI collection thus drops to what can be requested
politely from those sites.

Also, when a site doesn't respond at all, a 'long retry snooze' (default
15 minutes) occurs before trying that host again, so that all the
configured retries (default 30) aren't used up rapidly due to a
transient server/network problem.

More info is available at the FAQ:

https://webarchive.jira.com/wiki/display/Heritrix/unexpectedly+slow+crawling+on+idle+crawler

- Gordon @ IA

On 7/14/11 6:56 AM, Thakur, Pramila wrote:
> Hi Everyone,
> 
> Most of the time when I crawl a site, after some time it snoozes few
>  urls and the crawling process is like hanging, not terminated just
> going on without any activity.
> 
> Has any one of you faced this situation? Is there a work around it
> that can solve this issue?
> 
> Thanks,
> 
> --Pramila Thakur
> 
> ------------------------------------------------------------------------
>
> 
> 
> 
> ------------------------------------------------------------------------------
>
> 
AppSumo Presents a FREE Video for the SourceForge Community by Eric
> Ries, the creator of the Lean Startup Methodology on "Lean Startup 
> Secrets Revealed." This video shows you how to validate your ideas, 
> optimize your ideas and identify your business strategy. 
> http://p.sf.net/sfu/appsumosfdev2dev
> 
> 
> 
> _______________________________________________ 
> Archive-access-discuss mailing list 
> Arc...@li... 
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss