Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 unattended checkpointing - ID: 1302207
Last Update: Comment added ( karl-ia )

Checkpoints currently require manual pause, choose
checkpoint, manual resume. Should be possible to
pause/checkpoint/resume in one operator step, and to
schedule auto pause-checkpoint-resume at specified
intervals.


Gordon Mohr ( gojomo ) - 2005-09-23 22:00

7

Closed

None

Karl Thiessen

None

1.8.0

Public


Comments ( 5 )

Date: 2007-03-14 01:44
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-970 -- please add further
comments at that location.


Date: 2006-05-03 21:01
Sender: karl-ia

Logged In: YES
user_id=1269624

Verified per Gordon; closing.


Date: 2005-12-19 22:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added running of checkpointing using a timer thread if a
period is set in heritrix.properties.

Assigning Karl.

TO TEST:

There are two things to test:

1. Checkpointing is now an option in the UI and from JMX
even when the crawler is crawling. Test invoking
checkpoints with crawler in mid-flight. Does the
checkpointing thread properly manage the pausing,
checkpoint, resume? Can you confuse the crawler by running
a particular set of steps in a particular sequence (e.g. try
and resume crawl or recheckpoint while a checkpoint is running)?
2. Test the automated checkpoint on a period.

Both new features are discussed in the manual.
Checkpointing is here:
http://crawler.archive.org/articles/user_manual.html#checkpoint.
How to do the automated checkpointing is discussed here:
http://crawler.archive.org/articles/us*
src/articles/user_manual.xml
Updated note on checkpointing to include discussion of
checkpointing
managing pause of crawler and resume after checkpointing
if crawler was
'crawling' when checkpointing was invoked. Also added
description of
automated checkpointing on a schedule feature.
* src/conf/heritrix.properties
Added checkpointing section with INFO level logging for
Checkpointer
and with commented out automated checkpointing config.
* src/java/org/archive/crawler/Heritrix.java
Add passing in of CrawlJobHandler on construction rather
than as a
setJobHandler call. Moved configureTrustStore to
configureContainer.
Change selftest to use new constructor.
* src/java/org/archive/crawler/framework/Checkpointer.java
Add running of a timer thread if a period set in
heritrix.properties.
Make a common initialize method used both by constructor
and after
deserialization. Renamed recoveryAmendments as recovery.
* src/java/org/archive/crawler/framework/CrawlController.java
Formatting. Added call to cleanup of checkpointer
thread (removes timer
if one is installed).
(getCheckpointsDisk): Added.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
Allow recover may be null.
er_manual.html#automated_chkpt

Of note, the progress stats now show a note about
checkpointing and their format has changed a little.




Here is last commit:




Date: 2005-12-17 01:11
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Did below commit against this this issue.

Work toward '[ 1302207 ] unattended checkpointing'.
Call to checkpoint now will pause a running crawler, run
actual checkpoint,
then resume. Needs more testing. CheckpointContext renamed
Checkpointer
and checkpointing classes moved into framework so can run
the checkpoint
invoking default access methods on CrawlController.
* src/conf/heritrix.properties
Add logging of checkpointing.
* src/java/org/archive/crawler/admin/CrawlJob.java
Go to new CheckpointUtils to get serializing utility method.
(isCheckpointing): Added.
* src/java/org/archive/crawler/framework/CrawlController.java
Go to CheckpointUtils for checkpointing utility.
CheckpointContext renamed as Checkpointer.
Moved getjeLogsFilter here where its actually used.
(isCheckpointing): Added.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Use new CheckpointUtils to get serializing methods.
* src/webapps/admin/index.jsp
Move checkpointing to be a running job function from a
function that
only appears when crawler is paused.


Date: 2005-11-02 20:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Deferring to post 1.6.


Attached File

No Files Currently Attached

Changes ( 9 )

Field Old Value Date By
close_date - 2006-05-03 21:01 karl-ia
status_id Open 2006-05-03 21:01 karl-ia
assigned_to stack-sf 2005-12-19 22:17 stack-sf
artifact_group_id None 2005-12-17 01:11 stack-sf
priority 6 2005-12-13 18:30 stack-sf
priority 7 2005-11-02 20:34 gojomo
artifact_group_id 1.6.0 2005-11-02 20:34 gojomo
assigned_to nobody 2005-09-29 18:27 gojomo
artifact_group_id None 2005-09-23 22:03 gojomo