Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 IP-based politeness - ID: 903845
Last Update: Comment added ( karl-ia )

Should have an option to enforce politeness by IP
address, rather than hostname, for situations where a
large number of hostnames all resolve to same IP. (For
example, *.senate.gov.)


Gordon Mohr ( gojomo ) - 2004-02-25 01:15

9

Closed

None

Gordon Mohr

None

None

Public


Comments ( 5 )

Date: 2007-03-14 01:26
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-726 -- please add further
comments at that location.


Date: 2004-10-29 21:58
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Completed with checkins of the 27th. Commit comment:

Implementation of [ 903845 ] IP-based politeness
* AbstractFrontier.java
Make IP-based queue assignement (and thus politeness)
toggleable
* BdbFrontier.java
Move requeueing line of code
* HostQueuesFrontier.java
Make queueing policy togglable; move CrawlURIs to new
queue if their assignment changes while queued (as when IP
becomes known)

==
On either HostQueuesFrontier or BdbFrontier, CrawlURIs are
now checked when coming off a queue, to see if that's still
the queue they would currently be assigned to. If not, the
CrawlURI is assigned to their new queue.

An expert setting now allows IP-based queueing to be turned
on. If off, the classic hostname-based strategy
(HostnameQueueAssignmentPolicy) is used to assign URIs to
queues, and a URI's assigned queue will never change while
it is stored. If IP-based assignment is on, then the URI for
an unknonw host will use the classic assignment approach,
but as soon as the IP is known, URIs that come off
hostname-based queues will be requeued to IP-based queues.

This will have some effect on the ordering of URI visits, at
least in the HostQueuesFrontier. Those URIs that are
scheduled before the IP is known can wind up behind others
that were later scheduled directly to the IP queue.
(BdbFrontier's method of sorting URIs by serial number
prevents this.)

Another option is available short of IP-based politeness: a
per-domain/per-host override can be used to force URIs into
a particular named queue, regardless of the assignment
policy in effect. This could be used manually on domains
known to all be from the same small set of IPs (eg blogspot,
dailykos, etc.) to simulate IP-based politeness, or could be
used if you wanted to enforce politeness over a whole
domain, even though the subdomains are split across many
IPs. This is the 'force-queue' setting.



Date: 2004-10-06 02:30
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

UPDATE/ partial work done. Commit comment:
==========================================
Work towards [ 903845 ] IP-based politeness
* CrawlURI.java
Move classKey (target queue name) calculation to Frontier
* AbstractFrontier.java
Receive classKey calculation from CrawlURI; delegate to
a QueueAssignmentPolicy (not yet configurable).
Add 'force-queue' option, so that overrides can force
all subdomains into a single queue.
* BdbFrontier.java
When CrawlURI comes off a queue different than where it
would currently be assigned, move it to its new queue --
support for changing queues after IPs looked-up.
* HostQueuesFrontier.java
Receive classKey calculation from CrawlURI; delegate to
a QueueAssignmentPolicy (not yet configurable).
* QueueAssignmentPolicy.java,
HostnameQueueAssignmentPolicy.java, IPQueueAssignmentPolicy.java
Swappable policies for choosing which queue a CrawlURI
is assigned to. HostnameQueueAssignmentPolicy matches
Heritrix 1.0 hostname-based technique.
IPQueueAssignmentPolicy uses IPs when available.
======================================
Still need to:
- test IP-based queueing, make a choosable option
- further update HostQueuesFrontier to support requeuing
- fix an issue where dns: URIs don't see same overrides as
http: URIs



Date: 2004-09-10 18:23
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

smugmug.com is good example of why we'd want ip-based
politeness (from Dan).


Date: 2004-08-09 21:10
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

One option would be to recast the URIWorkQueues
(specifically KeyedQueues) to be IP-based rather than
hostname:port based. However, sometimes items need to be
queued before their IP is known -- the initial seed for the
site, the initial DNS lookup, etc. And if hostname was used
initially, then IP when the IP is known, items would have to
move to the right IP-based queue -- perhaps from multiple
provision hostname-based queues -- when the IP becomes known
(or changes).

I think a better option would thus be to have a separate
pool of created-on-demand per-IP locks. A URIWorkQueue would
only be able to provide a URI for crawling if the
corresponding IP lock (if IP is known) is available. If not,
the queue is somehow prevented from providing the URI --
perhaps it never enters READY state, instead in some sort of
IP_SNOOZED state, or is READY but also help off to the side.

The interactions with existing per-host settings may be
difficult to define intuitively: what if sub1.domain.com has
a 5-second politeness pause, while sub2.domain.com has a
20-second politness pause, even though they are the same IP?
The easiest thing to do in the code might be to let the
last-processed URI deternine what happens, since it will be
the one at hand when making the decision about where and how
long to 'snooze' a queue/IP so it is not revisited.



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
close_date - 2004-10-29 21:58 gojomo
status_id Open 2004-10-29 21:58 gojomo
assigned_to nobody 2004-09-01 23:20 gojomo
priority 6 2004-09-01 21:49 gojomo
priority 5 2004-08-09 20:29 gojomo