Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

6 [contrib] Generalize/Refactor BDB Frontier - ID: 1176934
Last Update: Comment added ( karl-ia )

Hi,

I have just refactored the BdbFrontier class (and its
companions BdbWorkQueue) to a more general, abstract
"WorkQueueFrontier" (and "WorkQueue" respectively) (the
BdbFrontier now is a subclass of WorkQueueFrontier and only
contains Bdb-specific code finally, whereas all management
code resides in the abstract classes now). The separation
probably helps in creating new queues (besides Sleepycat
BDB) and in integrating other frontier concepts like the
AdaptiveRevisitFrontier into a common frontier base.

Please find the patch file attached, feel free to
use/integrate it.

Best regards,
Christian


Christian Kohlschütter ( ck-heritrix ) - 2005-04-05 10:07

6

Closed

None

Michael Stack

API

1.6.0

Public


Comments ( 6 )

Date: 2007-03-14 01:40
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-914 -- please add further
comments at that location.


Date: 2005-05-07 01:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed. Closing.

Thanks for the patch Christian.

Here is commit:

Fix for '[ 1176934 ] [contrib] Generalize/Refactor BDB Frontier'
and for '[ 1122692 ] [contribution] New fixed number of
queues policy'.
Patches contributed by Christian Kohlschuetter. Tested and
reviewed
by St.Ack (Small and broad crawl run. Passes selftest.).
Here is comment from
Christian on refactoring of frontier:

"I have just refactored the BdbFrontier class (and its
companions BdbWorkQueue) to a more general, abstract
"WorkQueueFrontier" (and "WorkQueue" respectively) (the
BdbFrontier now is a subclass of WorkQueueFrontier and only
contains Bdb-specific code finally, whereas all management
code resides in the abstract classes now). The separation
probably helps in creating new queues (besides Sleepycat
BDB) and in integrating other frontier concepts like the
AdaptiveRevisitFrontier into a common frontier base."

Here is comment on his bucket queue assignment policy:

> Currently, I am performing broad crawls using
BroadScope/BdbFrontier. However,
> due to the number of host- or IP-keyed queues, an
OutOfMemoryError occurs
> very quickly after starting the crawl. One reason for this
is the RAM-based
> bookkeeping of subqueues -- the more queues, the more heap.
>
> I have evaded this by writing a
BucketQueueAssignmentPolicy class, which
> produces a _fixed_ number of subqueues ("buckets"), not
one per host or per
> IP. The queue key is computed by hashing the hostname (or
the IP, if
> available) modulo N (a fixed number, such as 1000).
>
> This way, I was able to increase the number of fetched
pages from ca. 400,000
> to 1,000,000. For some other reason, I still get OOMEs,
but I think that is
> caused by a different problem -- the number of queues did
not grow over the
> specified limit.
>
> Furthermore, I have modified AbstractFrontier to be able
to choose arbitrary
> queue assignment policies and replaced the current
"ip-politness" option by a
> selectbox.


* AbstractFrontier.java
Lots of whitespace added. Line lengths fixed. Javadoc.
Refer to new WorkQueues abstraction rather than to
BdbWorkQueues. Made queue assignments a list. List includes
Christians Bucket policy.
* BdbFrontier.java
Refactoring to extend WorkQueueFrontier. Bulk of this
class moved back into new class WQF.
* BdbMultipleWorkQueues.java
(countEntries): Added.
* BdbWorkQueue.java
Subclass new WorkQueue class. New class WQ now has bulk
of this
class.




Date: 2005-05-06 13:34
Sender: ck-heritrix

Logged In: YES
user_id=1220421

Hi Michael,

a new version of the patch is attached. The problem was that
the BdbFrontier itself has been referenced by the
BdbWorkQueues (the frontier is not serializable in a
BdbMap). Sleepycat bdb "kindly" hides the original exception
message behind an RuntimeExceptionWrapper (does not use
initCause()/getCause() but getDetail()).

Now the frontier instance is again passed to the queue
instance for each call to peek/insert/delete. In addition, I
have added some javadoc comments and removed the
countEntries() method, which currently is not recommended
for general use (too slow; I am working on a better solution).

The patch also contains my BucketQueueAssignmentPolicy
addition. Feel free remove it before commiting.

Christian



Date: 2005-05-05 20:05
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here's start for a commit message when alls working:

Fix for '[ 1176934 ] [contrib] Generalize/Refactor BDB Frontier'
Patch contributed by Christian Kohlschütter. Tested by St.Ack
Here is comment from Christian:

"I have just refactored the BdbFrontier class (and its
companions BdbWorkQueue) to a more general, abstract
"WorkQueueFrontier" (and "WorkQueue" respectively) (the
BdbFrontier now is a subclass of WorkQueueFrontier and only
contains Bdb-specific code finally, whereas all management
code resides in the abstract classes now). The separation
probably helps in creating new queues (besides Sleepycat
BDB) and in integrating other frontier concepts like the
AdaptiveRevisitFrontier into a common frontier base."

* AbstractFrontier.java
Refer to new WorkQueues abstraction rather than to
BdbWorkQueues.
* BdbFrontier.java
Refactoring to extend WorkQueueFrontier. Bulk of this
class moved back
into WQF.
* BdbMultipleWorkQueues.java
(countEntries): Added.
* BdbWorkQueue.java
Subclass new WorkQueue class. WQ now has bulk of this
class.



Date: 2005-05-05 19:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Hey Christian:

I've trying to apply your refactoring patch before we make
any radical changes in the src tree. The patch doesn't apply
properly. Seems rather to be an issue w/ the patch rather
than that the code has changed out from under it.

I hacked around and came up w/ the attached patch. This
will apply against HEAD. It passes selftest but when I try
to do a simple crawl I'm having serialization problems (See
below). Were you seeing this kinda of issue? Do you want
to try out the attached patch on your end?

Exception in thread "ToeThread #1"
com.sleepycat.util.RuntimeExceptionWrapper:
org.archive.crawler.frontier.BdbMultipleWorkQueues
at
com.sleepycat.bind.serial.SerialBinding.objectToEntry(SerialBinding.java:123)
at
com.sleepycat.collections.DataView.useValue(DataView.java:501)
at
com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:626)
at
com.sleepycat.collections.DataCursor.put(DataCursor.java:560)
at
com.sleepycat.collections.StoredContainer.put(StoredContainer.java:281)
at com.sleepycat.collections.StoredMap.put(StoredMap.java:230)
at
org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:452)
at
org.archive.util.CachedBdbMap.expungeStaleEntries(CachedBdbMap.java:425)
at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:330)
at
org.archive.crawler.frontier.WorkQueueFrontier.activateInactiveQueue(WorkQueueFrontier.java:571)
at
org.archive.crawler.frontier.WorkQueueFrontier.next(WorkQueueFrontier.java:513)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)


Date: 2005-04-06 23:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Christian:

Thank you for the patch. Looks sweet. We'll apply after
1.4.0 release (imminent).


Attached Files ( 3 )

Filename Description Download
frontier-refactored.patch Refactored frontier Download
frontier-refactored.patch2 Patch that works against HEAD Download
frontier-refactored-v3.patch Patched patch ;-) Download

Changes ( 9 )

Field Old Value Date By
artifact_group_id None 2005-09-23 21:08 gojomo
status_id Open 2005-05-07 01:17 stack-sf
close_date - 2005-05-07 01:17 stack-sf
assigned_to nobody 2005-05-07 01:17 stack-sf
File Added 133260: frontier-refactored-v3.patch 2005-05-06 13:34 ck-heritrix
File Added 133171: frontier-refactored.patch2 2005-05-05 19:43 stack-sf
summary Generalize/Refactor BDB Frontier 2005-04-06 23:06 stack-sf
priority 5 2005-04-06 23:06 stack-sf
File Added 128554: frontier-refactored.patch 2005-04-05 11:34 ck-heritrix