From: Gordon M. <go...@ar...> - 2011-07-14 18:39:47
|
Best to ask Heritrix-specific questions on its project list: http://tech.groups.yahoo.com/group/archive-crawler/ But also, a typical pattern for a focused crawl is for it to collect URIs rapidly when there are many different sites to contact. But later, once all URIs from smaller and fast/responsive sites have been collected, only those sites that are large, slow, and/or unresponsive are left. The rate of URI collection thus drops to what can be requested politely from those sites. Also, when a site doesn't respond at all, a 'long retry snooze' (default 15 minutes) occurs before trying that host again, so that all the configured retries (default 30) aren't used up rapidly due to a transient server/network problem. More info is available at the FAQ: https://webarchive.jira.com/wiki/display/Heritrix/unexpectedly+slow+crawling+on+idle+crawler - Gordon @ IA On 7/14/11 6:56 AM, Thakur, Pramila wrote: > Hi Everyone, > > Most of the time when I crawl a site, after some time it snoozes few > urls and the crawling process is like hanging, not terminated just > going on without any activity. > > Has any one of you faced this situation? Is there a work around it > that can solve this issue? > > Thanks, > > --Pramila Thakur > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------------ > > AppSumo Presents a FREE Video for the SourceForge Community by Eric > Ries, the creator of the Lean Startup Methodology on "Lean Startup > Secrets Revealed." This video shows you how to validate your ideas, > optimize your ideas and identify your business strategy. > http://p.sf.net/sfu/appsumosfdev2dev > > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |