Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 [contrib] WorkQueueFrontier: Store allQueues in RAM if poss. - ID: 1207898
Last Update: Comment added ( karl-ia )

Hi,
another contribution, this time for WorkQueueFrontiers
(eg. BdbFrontier).

Currently, the "allQueues" Map is always backed by a
Bdb database (using a CachedBdbBigMap).

In fact, this is not necessary in all cases, especially
when the BucketQueueAssignmentPolicy is in use (it has
a fixed number of queues).

Instead, allQueues can be stored in a HashMap, if the
assigned QueueAssignmentPolicy instance has a maximum
number of queues lower than a certain number (is 3000
ok?) and if the frontier implementation stores its
WorkQueue payload on harddisk.

This patch provides hooks to retrieve this information
(QueueAssignmentPolicy.maximumNumberOfKeys() and
WorkQueueFrontier.workQueueDataOnDisk()), so it should
be straightforward to apply this to non-standard
implementations, too.

Christian


Christian Kohlschütter ( ck-heritrix ) - 2005-05-24 15:32

5

Closed

None

Michael Stack

None

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 01:41
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-933 -- please add further
comments at that location.


Date: 2005-05-25 22:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed. Message below. Closing.

(Thanks Christian).

[debord 648] heritrix > more /tmp/diff.txt
Apply patch '[ 1207898 ] [contrib] WorkQueueFrontier: Store
allQueues in RAM
if poss.'.
Contributed by Christian Kohlschuetter. Tested and reviewed
by St.Ack

Here is comment on the below from Christian:

"Currently, the "allQueues" Map is always backed by a
Bdb database (using a CachedBdbBigMap).

In fact, this is not necessary in all cases, especially
when the BucketQueueAssignmentPolicy is in use (it has
a fixed number of queues).

Instead, allQueues can be stored in a HashMap, if the
assigned QueueAssignmentPolicy instance has a maximum
number of queues lower than a certain number (is 3000
ok?) and if the frontier implementation stores its
WorkQueue payload on harddisk.

This patch provides hooks to retrieve this information
(QueueAssignmentPolicy.maximumNumberOfKeys() and
WorkQueueFrontier.workQueueDataOnDisk()), so it should
be straightforward to apply this to non-standard
implementations, too."

* src/java/org/archive/crawler/frontier/BdbFrontier.java
*
src/java/org/archive/crawler/frontier/BucketQueueAssignmentPolicy.java
(workQueueDataOnDisk): Added.
*
src/java/org/archive/crawler/frontier/QueueAssignmentPolicy.java
(maximumNumberOfKeys): Added default implementation.
* src/java/org/archive/crawler/frontier/WorkQueueFrontier.java
Added test that overrides can influence that allows
determining whether
or not to use BigMap for allQueues.
(workQueueDataOnDisk): Added abstract method that
overrides must implement.



Attached File ( 1 )

Filename Description Download
AllQueuesHashMap.patch allQueues/BucketQueueAssignmentPolicy patch Download

Changes ( 5 )

Field Old Value Date By
assigned_to nobody 2005-11-23 23:36 gojomo
artifact_group_id None 2005-09-23 21:08 gojomo
close_date - 2005-05-25 22:06 stack-sf
status_id Open 2005-05-25 22:06 stack-sf
File Added 135753: AllQueuesHashMap.patch 2005-05-24 15:32 ck-heritrix