Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 recovery log of recovered crawl insufficient to recover - ID: 1052578
Last Update: Comment added ( karl-ia )

The recovery log of a recovered crawl only shows the
successful frontier adds -- not the
frontier-already-includeds from the first recovery.
Thus, a recovery from a recovered crawl won't do the
right thing -- it would recrawl things crawled in the
first crawl.

A workaround is to concatenate the two recovery logs to
perform the second recovery. The real fix would be to
ensure the recovery log of a recovered crawl has all
necessary entries.


Gordon Mohr ( gojomo ) - 2004-10-23 01:03

9

Closed

Duplicate

Michael Stack

None

None

Public


Comments ( 3 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-270 -- please add further
comments at that location.


Date: 2004-11-03 19:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Duplicate of '[ 1054849 ] Recover from crawl initialized
with a recovery log':
http://sourceforge.net/tracker/?group_id=73833&atid=539102&func=detail&aid=1054849+


The above added to recovery process the recording of
alreadyincluded lines into the new recovery.log (See
frontier.getFrontierJournal().finishedSuccess(u) line in below).


// Scan log for all 'Fs' lines: add as 'alreadyIncluded'
+ BufferedReader reader = getBufferedReader(source);
String read;
try {
while ((read = reader.readLine()) != null) {
if (read.startsWith(F_SUCCESS)) {
- UURI u;
String args[] = read.split("\\s+");
try {
- u = UURIFactory.getInstance(args[1]);
+ UURI u =
UURIFactory.getInstance(args[1]);
frontier.considerIncluded(u);
+
frontier.getFrontierJournal().finishedSuccess(u);
} catch (URIException e) {
e.printStackTrace();
}
@@ -171,10 +159,9 @@
reader.close();
}



Date: 2004-11-03 19:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Duplicate of '[ 1054849 ] Recover from crawl initialized
with a recovery log':
http://sourceforge.net/tracker/?group_id=73833&atid=539102&func=detail&aid=1054849+


The above added to recovery process the recording of
alreadyincluded lines into the new recovery.log (See
frontier.getFrontierJournal().finishedSuccess(u) line in below).


// Scan log for all 'Fs' lines: add as 'alreadyIncluded'
+ BufferedReader reader = getBufferedReader(source);
String read;
try {
while ((read = reader.readLine()) != null) {
if (read.startsWith(F_SUCCESS)) {
- UURI u;
String args[] = read.split("\\s+");
try {
- u = UURIFactory.getInstance(args[1]);
+ UURI u =
UURIFactory.getInstance(args[1]);
frontier.considerIncluded(u);
+
frontier.getFrontierJournal().finishedSuccess(u);
} catch (URIException e) {
e.printStackTrace();
}
@@ -171,10 +159,9 @@
reader.close();
}



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-11-03 19:25 stack-sf
resolution_id None 2004-11-03 19:25 stack-sf
close_date - 2004-11-03 19:25 stack-sf
assigned_to nobody 2004-11-03 19:17 gojomo
priority 5 2004-11-03 19:14 gojomo