Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Failed URIs should be 'free' (no cost against queue budget) - ID: 1209046
Last Update: Comment added ( karl-ia )

Once a large crawl has been under way for awhile, I
inevitably need to drop in some new filters to
eliminate some junk. These URIs all wind up with a -500X.

If I'm using a budgeted hold queues, this means that
all the -500Xs will be counted against the queue,
slowing its process. -500X take very little time to
process after all and the queue is quickly made
inactive. Other queues however still take their usual
amount of time. Basically it temporarily starves (or at
least limits) the queue. Additionally if a queue has a
max total budget, do we really want these to count
against it (probably no negative value URIs should).

-9998 maybe should also be free? In fact I'm wondering
if all negative codes shouldn't be free and only ACTUAL
crawling be counted against the queues budget.


Kristinn Sigurdsson ( kristinn_sig ) - 2005-05-26 10:33

7

Closed

Fixed

Gordon Mohr

Frontier

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-421 -- please add further
comments at that location.


Date: 2005-07-22 02:50
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

All 'disregarded' URIs (robots-precluded,
blocked-by-processor, out-of-scope, blocked-by-user,
too-many-hops, deleted-by-user) now refund their cost back
to their origin queue, making them essentially 'free'.
Commit comment:

Fix for [ 1209046 ] Failed URIs should be 'free'
Work towards [ 1056429 ] More compact, processable Frontier,
Threads reports
Improvement for [ 1219259 ] broad crawls slow; most threads
stuck retrying missing sites
* WorkQueue.java
add tracking of errors charged against queue, last
dequeue time (for reporting)
enable refund of cost for disregarded URIs
* WorkQueueFrontier.java
refund queue balance for disregarded (scope-recheck-fail
and robots, etc.) URIs
charge queues penalties for error URIs: gets 'dead'
queues to inactive/retirement thresholds faster
add 'nonempty' compact report of all nonempty queues; is
dumped from UI and at end of crawl; should allow offline
analysis of queue anomalies




Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:02 gojomo
summary Failed URIs should be 'free' 2005-08-09 17:35 gojomo
close_date - 2005-07-22 02:50 gojomo
resolution_id None 2005-07-22 02:50 gojomo
status_id Open 2005-07-22 02:50 gojomo
priority 5 2005-06-22 18:59 gojomo
assigned_to nobody 2005-06-22 18:59 gojomo