Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 upping total budget doesn\'t update/unretire queues - ID: 1440656
Last Update: Comment added ( karl-ia )

Igor reports and I've verified that increasing
queue-total-budget does not have the desired effect of
unretiring queues and increasing their effective budgets.

The queues are momentarily unretired, but retain their
original budget, and thus are immediately retired again.

The problem appears to be that changes to the
total-budget setting are only noticed in
WorkQueueFrontier.noteAboutToEmit(). Meanwhile, the
temporary unretiring of queues (from
WorkQueueFrontier.kickUpdate) results in them being
inactive, but then when their chance at activation
comes up, no CrawlURIs are emitted so the old
total-budget is never updated from settings.

The reason the settings are not consulted every time
the total budget is needed for a check is that usually,
during the queue-juggling steps where budget is
important, there is no current CrawlURI to serve as the
settings context. In fact, in replenishSessionBalance,
this is worked around by peeking for a temporary
CrawlURI to use as the context.

(This could cause other confusing behavior in the
future -- if CrawlURIs subject to different overrides
share the same queue (if IP politeness or queue-forcing
of subdomains into one queue), the settings in effect
for the queue would be somewhat random, depending on
which CrawlURI was in position when the queue was
determining its working values.)

I thought this had worked previously, but looking at
the relevant code back a few revisions before 1.6
doesn't show a recent change that would break it, nor
do I recall a change, and I don't see anything in the
current code suggesting it was disrupted from a
previously working condition.

A potential fix would be to reset the total budget at
the same time as the session balances are replenished
-- this would give any queue coming into active status
a chance to operate under its new budget parameters. It
would still leave the potentially confusing behavior
that a total-budget change would not affect an active
queue until after it goes inactive (by going over its
session or total budget) and then active again... but
that seems a minor concern.


Gordon Mohr ( gojomo ) - 2006-02-28 23:55

7

Closed

None

Karl Thiessen

Frontier

1.8.0

Public


Comments ( 3 )

Date: 2007-03-14 01:04
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-541 -- please add further
comments at that location.


Date: 2006-05-15 23:25
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Please create regression test if you have time or simply
verify/close as fixed otherwise.


Date: 2006-03-02 01:20
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I've committed the potential fix listed above. Total-budget
changes do not take effect immediately, but require either
(1) the session balance to be depleted; or (2) the old
total-budget to be depleted -- then the queue is
inactivated, and when it comes up again for consideration it
gets its new total-budget number.

This also works for un-retired queues, as they also will
have their total-budget updated when they come back up for
consideration.

Commit comment:
Fix for [ 1440656 ] upping total budget doesn't
update/unretire queues
* WorkQueueFrontier.java
move update of totalBudget from noteAboutToEmit to
replenishSessionBalance; in some cases this will cause a
change in totalBudget to be noticed later, but ensures that
it's possible to unretire queues.

Assigning to Karl for verification. (Should be easy to
auto-test via trivial crawl with tiny total-budget,
non-zero-cost policy, and pause-at-finish: let crawl reach
pause, ensure queue is retired, increase budget, unpause
crawl, ensure queue makes additional progress.)


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2006-09-11 22:03 karl-ia
summary upping total budget doesn't update/unretire queues 2006-09-11 22:03 karl-ia
close_date - 2006-09-11 22:03 karl-ia
artifact_group_id None 2006-03-02 01:20 gojomo
assigned_to gojomo 2006-03-02 01:20 gojomo