Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 enable crawl-end at target compressed-ARC-data size - ID: 1078008
Last Update: Comment added ( karl-ia )

Current cutoff for ending crawl at collected data size
is based on uncompressed size. More useful would be a
budget for compressed size, eg: "get exactly 1TB of ARCs".


Gordon Mohr ( gojomo ) - 2004-12-02 23:35

7

Closed

None

Michael Stack

Configuration

None

Public


Comments ( 4 )

Date: 2007-03-14 01:36
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-861 -- please add further
comments at that location.


Date: 2005-03-23 20:03
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Done. Closing. Here is commit:

Implement '[ 1078008 ] enable crawl-end at target
compressed-ARC-data size'
ARCWriterProcessor now keeps track of how many ARC bytes its
written and will
call requestCrawlStop if a maximum has been exceeded.
Feature is implemented
as a config. on ARCWriterProcessor, not on CrawlOrder.
* src/java/org/archive/crawler/framework/CrawlController.java
(requestCrawlStop): Added override that takes a message.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
Added new expert attribute total-bytes-to-write. Added
accounting of bytes
written (Accounting works for compressed streams).
(totalBytesWritten, checkBytesWritten, getMaxToWrite):
Added.
* src/java/org/archive/io/arc/ARCWriter.java
Added data member fos so can get at channel and get
current position in
file.
(checkARCFileSize): Changed access to be public so can
be called from
AWP.
(getPosition): Added.


Date: 2005-03-23 17:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Ok. Had another idea. Just get file postion at start and
end of record writing. This seems to work. I'll code this up.


Date: 2005-03-22 20:44
Sender: nobody

Logged In: NO

To do this, would need to be able to get back from writers
the compressed length written.

Our writers -- compressed and uncompressed -- implement
OutputStream. OS#write methods do not return (compressed)
length written.

I could go behind the OutputStream interface and try and get
the position of the underlying file but the stream is
buffered and it goes through a deflator so position will lag
actual position -- lengths will be off (Same logic holds if
I ask deflator for where we are currently).

Putting this issue aside till triage meeting. Looks like a
bit of work -- a day or more -- with lots of changes to
ARCWriting.


Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2005-03-23 20:03 stack-sf
close_date - 2005-03-23 20:03 stack-sf
assigned_to nobody 2004-12-03 22:54 gojomo