Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 inactiveQueuesMemoryLoadTarget mechanism behaves poorly - ID: 1002332
Last Update: Comment added ( karl-ia )

Currently, when operating in site-first/hold-queues
mode, there is a target maximum value for the total
number of CrawlURIs that all INACTIVE queues should
have in memory. If all INACTIVE queues, together, have
more than this target, then another hard quota for
individual queues is decremented and enforced whenever
a larger queue is encountered in normal operations. If
all INACTIVE queues, together, have less than this
target, then the per-queue quota is incremented.

The idea was that the per-queue threshold would rise
and drop as necessary to keep the actual number of
INACTIVE queue in-memory CrawlURIs tending towards the
target.

However, when the number of INACTIVE queues exceeds the
target number of in-memory CrawlURIs (as is quite easy
in broad crawls), even a single in-memory CrawlURI for
every queue would result in being way over target. The
per-queue target thus tends to decrement into negative
values, and every CrawlURI destined for an inactive
queue is immediately flushed to disk, avoiding the
intended batching efficiency this mechanism was hoping
to create.

(Further, it may never in practice effectively batch
again.)

This process needs a redesign; the current
implementation is probably not offering any benefit for
its complexity. A plain always-write-through policy
might be just as good and would be much more simple.


Gordon Mohr ( gojomo ) - 2004-08-02 23:28

6

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-218 -- please add further
comments at that location.


Date: 2004-10-20 21:47
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This problematic mechanism has been removed from
HostQueuesFrontier in HEAD/1.1+. To the extent that the old
mechanism was working at all -- which I don't think it was
-- this change may have slightly increased this frontier's
memory footprint, a problem that is being dealt with on
other issues.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
resolution_id None 2004-10-20 21:47 gojomo
assigned_to nobody 2004-10-20 21:47 gojomo
close_date - 2004-10-20 21:47 gojomo
status_id Open 2004-10-20 21:47 gojomo
priority 5 2004-10-20 21:47 gojomo
priority 6 2004-10-20 21:45 gojomo
priority 5 2004-09-01 21:57 gojomo