Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 frontier report unusable in big crawls; frontier info needed - ID: 1002335
Last Update: Comment added ( karl-ia )

In a crawl with tens of thousands of active (or
inactive) hosts, the frontier report is gigantic, and
will often fail with OOM when its basic info (like
total numbers of queues in different states) is most
interesting.

A summary version of the frontier report that always
succeeds is needed, and some top-level info -- like #
of hosts active/inactive -- should be promoted to
console or progress-statistics.log


Gordon Mohr ( gojomo ) - 2004-08-02 23:35

6

Closed

Fixed

Nobody/Anonymous

Frontier

None

Public


Comments ( 3 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-219 -- please add further
comments at that location.


Date: 2004-10-14 22:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Also capped number of queues displayed in
HostQueuesFrontier, BdbFrontier reports, so that (at a loss
of potentially-useful info) composing the frontier reports
should use a bounded, manageable amount of memory.


Date: 2004-08-05 22:40
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added oneLineReport to Frontier, visible in UI's 'reports'
tab, which shows total queues, and queues in
ready/snoozed/inactive state, even without drilling into
detailed report.

Should allow us to see how many queues is typical for OOMs,
and thus document how large of a crawl we recommend doing
with 1.0.0.



Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-10-14 22:53 gojomo
resolution_id None 2004-10-14 22:53 gojomo
close_date - 2004-10-14 22:53 gojomo
priority 5 2004-09-01 21:57 gojomo