Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 per-crawler \\\'load\\\' summary numbers - ID: 1302208
Last Update: Comment added ( karl-ia )

To assist in balancing a multi-machine crawl, crawler
should tally and report some running idea of its 'load'
based on its backlog of queues and longest queues.
Examining these TBD values should offer guidance on
whether more or less of the URI space should be shifted
to any one crawler.


Gordon Mohr ( gojomo ) - 2005-09-23 22:02

7

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 01:44
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-971 -- please add further
comments at that location.


Date: 2005-11-12 01:02
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Basic set of 3 new statistics implemented: congestionRatio
('overhang' of previous comment), deepestUri (longest
eligible queue), averageDepth (length of average queue).
These are both logged in progress-statistics and displayed
in the main console.

As the main console (and header) was getting congested, and
had a number of slight alignment/terminology/wordiness
issues, upon adding the 3 new stats to the readouts, I've
also done a major rearrangment of the 'console', and slight
tweaks to the common page 'head'.

Commit comment #1 (basic stat tracking/logging):

Implementation of [ 1302208 ] per-crawler 'load' summary numbers
* StatisticsTracker.java
accessors and fields for congestionRatio, deepestUri,
averageDepth; add new stats to end of progress log-lines
* AbstractTracker.java
updated progress log-line legend
* Frontier.java, StatisticsTracking.java
accessors for new stats
* AdaptiveRevisitFrontier.java
dummy (-1) values for new stats until proper
implementation possible
* WorkQueue.java
add 'retired' flag, so that retired queues can be left
out of longest-queue tracking
* WorkQueueFrontier.java
accessors which calc new stats; remember longest queue
seen; maintain 'retired' flag on queues

Commit comment #2 (web UI support and rearrangement):

UI implementation (and more) for [ 1302208 ] per-crawler
'load' summary numbers
* ArchiveUtils.java
new formatMilliseconds... option leaving off 'ms' amount
* index.jsp
rearrangement of console: separate 'crawler' and 'job'
status boxes; move controls to titlebar of appropriate box
(or inside box for checkpoint/edit-frontier); add new stats;
group old stats; improve progress bar detail and color.
* head.jsp
harmonize terms with new console; minimize text; tweak
alignments
* heritrix.css
CSS styles supporting above JSP changes

Assigning to Karl for review/any-desired-verifications/closing.



Date: 2005-11-09 21:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Talking with Danny this morning, we should make it so you
can ask a 'container' what its 'loading' is -- where
'container' is host to usually one, but possibly more than
one, instance of Heritrix and container 'loading' would be
the sum of each Heritrix instance 'loadings' (Danny would
like this figure so that he can avoid adding new Heritrix
instances/jobs to already loaded containers).


Date: 2005-11-09 20:52
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I can think of 3 key stats that can help in this role:

(1) longest 'queue' length -- essentially the longest string
of URIs whose fetching must be serialized and
politeness-throttled

(2) average (or perhaps median) 'queue' length

(3) 'overhang' (?)-- defined as follows: Look at how many
in-process+snoozed queues the actual number of ToeThreads
are handling. Extrapolate how many additional ToeThreads
would be necessary for *all* queues to be in-process or
snoozed (assuming additional ToeThreads would be 'as
productive' as the current set, no diminishing returns). In
such a state, there would be no 'ready' or 'inactive'
queues. 'overhang' is the 'total hypothetical threads
needed' divided by the 'actual number of threads'. A roughly
similar analysis would be applied when not all threads are
needed: 'overhang' would be 'actual threads being active'
divided by 'all threads available'.

'overhang' would be <1 when the crawler has excess capacity,
and >1 when it is fully engaged. Further, in a very rough
way, values >1 could be considered an estimate of how many
additional equally-provisioned crawlers would be necessary
to be making maximal possible polite progress. That is, an
'overhang' of 7.5 would mean 6.5 more crawlers would have to
split up the current workload to have no backlog of
ready-but-uncontacted hosts/queues.

(There's probably a better word for this than 'overhang'.)


Date: 2005-11-02 20:33
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Should be added to progress-stats.log, JMX-queriable?


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
summary per-crawler \'load\' summary numbers 2006-09-11 22:32 karl-ia
summary per-crawler 'load' summary numbers 2006-09-11 22:31 karl-ia
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
assigned_to gojomo 2005-11-12 01:02 gojomo
assigned_to nobody 2005-11-02 19:31 gojomo
priority 5 2005-09-23 22:06 gojomo
artifact_group_id None 2005-09-23 22:03 gojomo