Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Compress recover.log - ID: 964493
Last Update: Comment added ( karl-ia )

The recover.log could be compressed at very little CPU
cost, with great disk savings. (This would also move it
away from the Logger framework, which is overkill for
its needs.)


Gordon Mohr ( gojomo ) - 2004-06-01 19:21

5

Closed

None

Nobody/Anonymous

None

None

Public


Comments ( 3 )

Date: 2007-03-14 01:30
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-783 -- please add further
comments at that location.


Date: 2004-07-30 18:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here are notes made testing this feature (Copied from
http://crawler.archive.org/cgi-bin/wiki.pl?CrawlTestPlan):

Ran a general crawl. After a while, killed it killing the
crawling process. Did a 'gzip -t' on the recover.gz file.
Gzip reports it abnormally terminated. 'zcat' gets recover
log lines until it hits the uncompleted 32k compression
block, the one being written when we killed the program.

Restarted the crawl. Created new job based on the killed
job. Near the bottom of the settings page is a
'recovery-path' option. I filled in full path to the
recover.gz from the killed job. Hit submit job. After a
while taken to a status screen that says no jobs to run
suggesting I create one (I suppose the recover.log is being
loaded but no indication given in UI). This was
disorientating. After a while I hit the reload and then I
saw the crawl continue. Redid it and this time hang was at
the 'Submit Job' button. Lots of disk access. Guess its
loading the recover log. When status page comes back this
time, it says crawler is running and a reload gets me the
status bar. This is better than the first time I tried it.

Nothing in any of the logs saying that we've run the recover
log. What about the URIs that were dropped because of
corrupted gzipped recover? Should user be worried about these?

Documented this feature in the user manual.


Date: 2004-07-21 01:39
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fixed while addressing:

[ 926329 ] referral URL should be stored in recover.log

There is now a helper class, RecoveryJournal, in the
frontier package which handles the writing and re-scanning
of the recovery journal. It avoids using Java logging,
gzip-compresses the stream, and can internalize any future
expansion of the CrawlURI info saved.


Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
close_date - 2004-07-21 01:39 gojomo
status_id Open 2004-07-21 01:39 gojomo