Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 One-click recover - ID: 1093609
Last Update: Comment added ( karl-ia )

It should be easy to launch a recovery crawl from a
prior crawl -- just a click or two, and have reasonable
defaults used for arcs, logs, etc. in a way that
doesn't clobber old logs.


Gordon Mohr ( gojomo ) - 2004-12-31 03:10

7

Closed

None

Gordon Mohr

None

None

Public


Comments ( 3 )

Date: 2007-03-14 01:37
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-878 -- please add further
comments at that location.


Date: 2005-03-24 23:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added crude protection against a
recovery-into-same-absolute-directories clobbering or being
disturbed by older data: any time the 'logs' or 'state'
direectory is non-empty in a recovered crawl, a '-R' will be
appended to the path until an empty directory is found.
Commit comment:

completion of [ 1093609 ] One-click recover
* CrawlJobHandler.java
While 'logs' and 'state' directories of recovered crawl
are not empty, append '-R' to their paths. Eventually a path
the generates a new empty directory will be reached, making
it safe to 'one-click-recover' even crawls with absolute
'disk', 'logs', or 'state' paths.

Also documented effect of 'recover' link option on 'new job'
'based on old job' job listing page. Commit comment:

completion of [ 1093609 ] One-click recover
* user_manual.xml
Explanation of effect of new 'Recover' option on new job
'based on old job' job listing.

Closing as implemented.


Date: 2005-03-23 22:26
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Partially implemented: should work well for crawls that use
default relative paths for 'logs', 'scratch', 'state'. Might
cause problems if recovering a job that uses absolute paths
-- especially in 'logs', with new files clobbering old.
Still working on that case.

Commit comment for partial implementation

partial implementation of [ 1093609 ] One-click recover
* webapps/admin/jobs/basedon.jsp, webapps/admin/jobs/new.jsp
add 'recover' links after completed jobs; carry forward
as flag on new job creation
* CrawlJobHandler.java
take 'isRecover' on newJob based-on-other-job methods,
use to indicate old job's recover log should be put into
new-jobs 'recover-path'
* Heritrix.java
use new wider newJob method
* CrawlOrder.java
convenience method to get order-relative paths without
needing CrawlController (method adapted from there)
* AbstractFrontier.java, FrontierJournal.java,
RecoveryJournal.java
make constants used to construct standard 'recover.gz'
log name reside in more public/shared places


*


Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2005-03-24 23:59 gojomo
close_date - 2005-03-24 23:59 gojomo