Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 hold paused crawl at \\\'end\\\', allowing all in-progress ops - ID: 1069331
Last Update: Comment added ( karl-ia )

Even when a crawl 'ends' -- URIs exhausted, limits
exceeded, etc. -- an operator might still want to
perform operations, such as readding certain
URIs/seeds, that can only be done to a 'live' crawl.
So, there should be an option to leave a crawl 'open'
even when ostensibly 'finished'.

This could be an option to be turned on/off, and if
there are no other jobs pending, it could always be done.


Gordon Mohr ( gojomo ) - 2004-11-19 10:53

7

Closed

None

Karl Thiessen

Usability/UI

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 01:36
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-858 -- please add further
comments at that location.


Date: 2005-08-11 22:05
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Pause-at-start option implemented. Commit comment:

Implementation of [ 1069331 ] hold paused crawl at 'end',
allowing all in-progress ops (pause-at-start addition)
* Frontier.java
add 'start()' method as specific place to do crawl-start
activity
* CrawlController.java
have crawl start use frontier.start(); remove vestigial
previous attempt at begin-paused functionality; have
requestCrawlPause() complete-the-pause if no threads are
in-progress
* AbstractFrontier.java
add 'pause-at-start' option (default false); when set,
start() does a requestCrawlPause() rather than unpause() to
begin
* AdaptiveRevisitFrontier.java, HostQueuesFrontier.java
add default start() implementation: just unpause
* CrawlJobHandler.java
remove redundant explicit setting of job status that
introduced a bug when CrawlController.requestCrawlPause()
can complete the pause immediately

Assigning to Karl for verification.


Date: 2004-12-15 18:25
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fixed for crawl end; commit comment:

Implementation of [ 1069331 ] hold paused crawl at 'end',
allowing all in-progress ops
* CrawlController.java
separate atFinish() test from checkFinish() operation;
raise visibility of beginCrawlStop; improve comments
* AbstractFrontier.java
add pause-at-finish setting; if set, only truly 'end'
crawl if finish condition is met coming out of a pause.
Otherwise, treat finish condition as request to pause.

Leaving open at lower priority to add pause-at-crawl-start
capability.


Date: 2004-11-19 18:42
Sender: nobody

Logged In: NO

I've been thinking we need something like this at crawl
start too -- that you'd set up an order, start it, and
though there is nothing currently to crawl, that the crawler
would sit in a 'ready-to-crawl' holding pattern waiting on
submission of URLs to crawl (While in this configuration, it
wouldn't quit until it got a shutdown message). We'd use
'ready-to-crawl' state in a multimachine configuration: We'd
ready the hive of crawlers by putting them all into the
'ready-to-crawl' configuration and we'd then start injecting
URLs into the cluster.


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
summary hold paused crawl at \'end\', allowing all in-progress ops 2006-09-11 22:32 karl-ia
summary hold paused crawl at 'end', allowing all in-progress ops 2006-09-11 22:31 karl-ia
close_date - 2005-12-02 17:29 stack-sf
status_id Open 2005-12-02 17:29 stack-sf
artifact_group_id None 2005-09-23 20:53 gojomo
priority 6 2005-09-23 18:37 gojomo
assigned_to gojomo 2005-08-11 22:05 gojomo
priority 9 2004-12-15 18:25 gojomo
priority 7 2004-12-03 22:50 gojomo
assigned_to nobody 2004-12-03 01:02 gojomo
summary hold crawl at 'end', allowing all in-progress ops 2004-12-03 01:02 gojomo