Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 DiskIncludedFrontier performance is awful - ID: 1000840
Last Update: Comment added ( karl-ia )

Running a broad, polite, site-first crawl with a
DiskIncludedFrontier and 150 toethreads on a red box
(labcrawl02):

(1) starting crawl took many minutes, while
disk-hashtable was zeroed. During this time job
appeared as neither pending nor in-progress, which was
confusing.

(2) Once crawl started, performance never exceeded 10
uris/second, more often was < 5 uris/second.

DiskIncludedFrontier's current weaknesses should be
documented, and improved where possible, perhaps with a
completely different approach.


Gordon Mohr ( gojomo ) - 2004-07-30 17:58

7

Closed

Fixed

Gordon Mohr

Frontier

None

Public


Comments ( 5 )

Date: 2007-03-14 00:14
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-207 -- please add further
comments at that location.


Date: 2004-10-27 16:11
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

DiskIncludedFrontier removed from HEAD. Commit comment:

Addressing [ 1000840 ] DiskIncludedFrontier performance is awful
* DiskIncludedFrontier.java
Eliminating in favor of BdbFrontier.


Date: 2004-10-21 18:20
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I agree with the proposal (Gordon says the BDB
implementation performs better than the DiskIncludedFrontier).


Date: 2004-10-21 00:37
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I agree with the proposal (Gordon says the BDB
implementation performs better than the DiskIncludedFrontier).


Date: 2004-10-20 21:43
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Propose: removing DiskIncludedFrontier from 1.2 in favor of
BdbFrontier.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-10-27 16:11 gojomo
resolution_id None 2004-10-27 16:11 gojomo
close_date - 2004-10-27 16:11 gojomo
priority 6 2004-10-20 21:43 gojomo
assigned_to nobody 2004-10-20 21:43 gojomo
priority 5 2004-09-01 21:57 gojomo