Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 Recover from crawl initialized with a recovery log - ID: 1054849
Last Update: Comment added ( karl-ia )

The recovery log from a crawl prepopulated with a
recovery log from a previous crawl cannot be used to
prepopulate a subsequent crawl. Fix.


Michael Stack ( stack-sf ) - 2004-10-26 20:32

9

Closed

None

Michael Stack

i/o

None

Public


Comments ( 2 )

Date: 2007-03-14 01:35
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-842 -- please add further
comments at that location.


Date: 2004-10-26 23:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed on branch and head. Below is the commit:

Fix for '[ archive-crawler-Feature Requests-1054849 ]
Recover from crawl
initialized with a recovery log' and '[ archive-crawler-Feature
Requests-1054851 ] Import gzipped or non-gzipped recovery log'.
* src/java/org/archive/crawler/frontier/FrontierJournal.java
Added interface for frontier journaling. Current
implementation
writes the Recovery log (RecoveryJournal).
* src/java/org/archive/crawler/framework/CrawlController.java
Moved recover journal out of CC and into Frontier. It is
a kinda crawl
log but its more coherent having it live in the package
that is emitting
the events being journaled.
(recover): removed.
* src/java/org/archive/crawler/framework/Frontier.java
Added new method, getFrontierJournal. Returns the
frontier journal instance.*
src/java/org/archive/crawler/frontier/AbstractFrontier.java
Added in calling the FrontierJournal methods on key
Frontier events.
Current implementation instantiates RecoveryJournal as
the FrontierJournal.
(recover, LOGNAME_RECOVER, initialize,
getFrontierJournal, doJournal*):
Added.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
*
src/java/org/archive/crawler/frontier/ExperimentalFrontier.java
Call super.initialize. Call the doJournal* methods.
*
src/java/org/archive/crawler/frontier/DomainSensitiveFrontier.java
Go via getFrontierJournal.
* src/java/org/archive/crawler/frontier/HostQueuesFrontier.java
Copy from AbstractFrontier the recover and doJournal*
methods.
Also do recover log set up in initialize.
* src/java/org/archive/crawler/frontier/RecoveryJournal.java
Refactoring. Made it implementation of FrontierJournal.
(finishSuccess): Added override for uuris.
(getBufferedReader): Added an open method that does
different dependent
on recover log ending.



Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2004-10-26 23:21 stack-sf
close_date - 2004-10-26 23:21 stack-sf