Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 'Economic' frontier which defers low-value URIs - ID: 1078016
Last Update: Comment added ( karl-ia )

For large scale crawls that exceed our ability to prune
traps, we'd like a frontier that tends to put less
valuable URLs at the end, and does not go too deep
(into low value URLs) on one site before getting higher
value URLs on another.

This suggests an 'economic' frontier that assesses the
value of URLs, and deactivates active queues whose
topmost content is low-value in favor of inactive
queues whose topmost content is more valuable. Or,
similarly, track expenditure on a particular
queue/host, and rotate to other queues after certain
levels of expenditure have been reached.

It is expected this would be an enhancement to the
BdbFrontier.

This implies:
- an adjustable way to assign a value/cost to URLs,
using only the static info available (path, via)
- a way to activate/deactivate queues, much like
site-first, but in response to
topmost-value/expenditure data

Related features could involve:
- a UI to bump-up or knock-down the budget for a
specific queue. (corner cases would leave a queue
always-active or always-inactive)
- an overall crawl pause once a certain amount of
effort has been expedned, awaiting operator approval to
continue


Gordon Mohr ( gojomo ) - 2004-12-02 23:48

9

Closed

None

Gordon Mohr

Configuration

None

Public


Comments ( 5 )

Date: 2007-03-14 01:36
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-862 -- please add further
comments at that location.


Date: 2004-12-15 18:32
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Core implementation completed. Commit comment:

Implementation (essentially complete) for [ 1078016 ]
'Economic' frontier which defers low-value URIs
* BdbFrontier.java
Add tracking of total/lifetime budget and expenditures
per queue;
Add concept of 'retired' queue, which still has items
but is over total budget, and so permanently inactive
(unless the operator intervenes)
Clean up terminology to better distinguish 'budget'
(limit/threshold) and 'balance' (running total of units left
in current activation)
Improve per-queue reporting to show retired queues,
budgetting figures




Date: 2004-12-15 18:28
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Implemented zero-cost policy as way to eliminate budgetting
on/off toggle -- instead can disable budgetting by makingall
URIs zero-cost. COmmit comment:

Support for [ 1078016 ] 'Economic' frontier which defers
low-value URIs
* ZeroCostAssignmentPolicy.java
No-op costing policy; allows elimination of
'use-budgetting' setting, instead just set URIs to have zero
cost, and budget mechanisms never have effect.


Date: 2004-12-09 00:21
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Commit:

Work (in progress) towards [ 1078016 ] 'Economic' frontier
which defers low-value URIs
* BdbFrontier.java
Make budget-driven deactivation (and thus rotation
between queues) configurable: whether it is in effect at
all, and how large the budget should be on each
reactivation. Current cost of all URIs is '1'.

More notes:
* Using hold-queues=false and use-budgetted-rotation=false
retains classic behavior, where all queues ever created are
round-robined
* Using hold-queues=true and use-budgetted-rotation=false
mimics 'site-first' behavior: a queue remains active until
it is exhausted of URIs. Inactive queues only become active
when the crawler has nothing on active queues ready to
crawl. A site in endless trap-junk that's active will stay
active, starving other sites of attention.
* Using hold-queues=true and use-budgetted-rotation=true
will cause a site that has been active for a while (until
its 'budget' is used up) to be deactivated and placed at the
back of the inactive queues list. This increases the chances
the crawler will need to activate queues from the front of
the inactive queues list. The intended effect is to
concentrate on a site to a degree, but then even if it's not
finished, let other sites have a chance to catch up.

The current 'budget' per activation is 3000, and each
attempt at processing a URI (even attempts that fail or
trigger prerequisite tries) costs 1 budget unit. So queues
rotate out of the active positions about every 3000 URIs.
Configurable URI costing is coming up next.


Date: 2004-12-08 00:06
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

First step committed:

Work towards [ 1078016 ] 'Economic' frontier which defers
low-value URIs
* BdbFrontier.java
Add 'hold-queues' option (site-first) to keep queues
'inactive' until they are needed to keep crawler threads
fully busy


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
close_date - 2004-12-15 18:32 gojomo
status_id Open 2004-12-15 18:32 gojomo
assigned_to nobody 2004-12-03 22:52 gojomo
priority 7 2004-12-03 22:50 gojomo
priority 5 2004-12-03 00:07 gojomo