Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 Gzipped recover log corrupt at end; last <32K unrecoverable - ID: 998184
Last Update: Comment added ( karl-ia )

Gzipping the recover log assumes the crawl terminates
gracefully, that the gzipped stream will get a close
and be given a chance to write out the required gzip
CRC tail. W/o this close, the written gzipped file is
uninterpretable (Its called out of
CrawlController#completeStop)

This makes the recovery log mechanism effectively
useless if its intent is, as I understand it, meant for
any case but graceful termination: e.g. As a recovery
mechanism for when the crawler crashes.


Michael Stack ( stack-sf ) - 2004-07-26 18:03

5

Closed

Wont Fix

Gordon Mohr

Disk I/O

None

Public


Comments ( 4 )

Date: 2007-03-14 00:14
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-200 -- please add further
comments at that location.


Date: 2004-10-14 23:10
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Not a bug. Recover log is only supposed to approach the
state of the crawler at the time of a problem, not have
resolution of every frontier operation, so losing a little
due to unclosed gzip state at end is OK tradeoff.


Date: 2004-08-02 18:20
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

recover log should be closed properly on operator
crawl-termination -- but that's not the scenario for which
recover log is designed. Random failure up to whole machine
crash will unavoidably leave gzip stream corrupt at end, but
loss is limited, and recover log isn't supposed to recover
perfectly, for every URI of progress, just save "most" work.

Possibly could flush recover log more often, ensuring lost
end-segment is smaller.

Probably should just doc and close.



Date: 2004-07-30 01:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

So, not the whole log is useless... just the last compressed
32k corrupted block. All before that can be read by the
fronter.


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-10-14 23:10 gojomo
resolution_id None 2004-10-14 23:10 gojomo
assigned_to nobody 2004-10-14 23:10 gojomo
close_date - 2004-10-14 23:10 gojomo
summary Revisit gzipping recover log 2004-08-02 18:20 gojomo