For large scale crawls that exceed our ability to prune
traps, we'd like a frontier that tends to put less
valuable URLs at the end, and does not go too deep
(into low value URLs) on one site before getting higher
value URLs on another.
This suggests an 'economic' frontier that assesses the
value of URLs, and deactivates active queues whose
topmost content is low-value in favor of inactive
queues whose topmost content is more valuable. Or,
similarly, track expenditure on a particular
queue/host, and rotate to other queues after certain
levels of expenditure have been reached.
It is expected this would be an enhancement to the
BdbFrontier.
This implies:
- an adjustable way to assign a value/cost to URLs,
using only the static info available (path, via)
- a way to activate/deactivate queues, much like
site-first, but in response to
topmost-value/expenditure data
Related features could involve:
- a UI to bump-up or knock-down the budget for a
specific queue. (corner cases would leave a queue
always-active or always-inactive)
- an overall crawl pause once a certain amount of
effort has been expedned, awaiting operator approval to
continue
Gordon Mohr
Configuration
None
Public
|
Date: 2007-03-14 01:36
|
|
Date: 2004-12-15 18:32 Logged In: YES |
|
Date: 2004-12-15 18:28 Logged In: YES |
|
Date: 2004-12-09 00:21 Logged In: YES |
|
Date: 2004-12-08 00:06 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| close_date | - | 2004-12-15 18:32 | gojomo |
| status_id | Open | 2004-12-15 18:32 | gojomo |
| assigned_to | nobody | 2004-12-03 22:52 | gojomo |
| priority | 7 | 2004-12-03 22:50 | gojomo |
| priority | 5 | 2004-12-03 00:07 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use