Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

3 Change arc download dir mid-crawl - ID: 953994
Last Update: Comment added ( karl-ia )

Below is a note from Kris:

"...What I'm thinking is that when the user changes the
setting any new ARC files created after that point will
be written to the new location but any currently open
ARCs will be completed where they are.

"The reason for this is that a crawl here was a bit
misconfigured and is downloading ARCs to a disk with
insufficient space for the amount of data we are
gathering. Even moving the ARCs away the crawler will
exhaust available space overnight..."

This will be awkward to do because the arc dir is set
once on initialization of ARCWriterProcessor. The
arcdir location is passed into the ARCWriterPool which
in turn passes it to ARCWriters as they are created.

One thing to look at is having ARCWriterProcessor
monitor for arcdir changes. It could then update
ARCWriterPool which could then pass on the message to
all members of the ARCWriter pool (If the arcdir
directory were a static it'd make things a little easier).


Michael Stack ( stack-sf ) - 2004-05-14 14:21

3

Closed

None

Michael Stack

i/o

None

Public


Comments ( 3 )

Date: 2007-03-14 01:30
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-776 -- please add further
comments at that location.


Date: 2005-01-13 02:54
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. Implemented in [ 988276 ] ARC writer pool config.
to write multiple disks


Date: 2004-05-14 15:19
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

An alternative would be to ensure that if diskspace ran out
where the ARCs (or in fact any other files come to think of
it) then Heritrix gracefully pauses and alerts the user.
The problem I encountered was that once out of space
Heritrix essentially crashed (i.e. I was not able to recover
the crawl once I'd freed up space).


Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2005-01-13 02:54 stack-sf
close_date - 2005-01-13 02:54 stack-sf