Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 Multimachine Crawl Splitter Processor - ID: 1261506
Last Update: Comment added ( karl-ia )

Its likely that all multimachine logic can be done
within the confines of a single processor, a 'Crawl
Splitter' (or 'Host Splitter' as its called in
Mercator) processor. There likely would be two 'Crawl
Splitter' instances in the processing chain. The
'Crawl Splitter' would look at URLs just before they
were passed to the Frontier and also as they are coming
out of the Frontier in case number-of-crawlers changed
after Frontier-insert. This processor would have a
(pluggable?) algorithm for figuring whats to be crawled
by the local host and for those URLs not meant for the
local crawler, a means of figuring which of the remote
machines it is to pass the discovered URIs to. The
Crawl Splitter would be responsible for passing URLs to
other crawlers. Ideally the alogrithm would have the
'contravarient' properties described in the ubicrawler
paper where a minimum amount of URLs are reshuffled
amongst crawlers on addition/subtraction of new crawler
instances. It would also be sweet if we could adjust
crawler 'capacity' on the fly so its possible to
lighten the load of overloaded crawlers. Optimizations
would include passing batches rather than single URLs
to remote crawlers.


Michael Stack ( stack-sf ) - 2005-08-17 00:19

9

Closed

None

Gordon Mohr

multimachine

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 01:43
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-961 -- please add further
comments at that location.


Date: 2005-12-02 01:36
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

CrawlSplitter eliminated. Commit comment:

Followup for [ 1261506 ] Multimachine Crawl Splitter Processor
* CrawlSplitter.java, Processor.options
remove CrawlSplitter; superceded by CrawlMapper



Date: 2005-10-11 21:31
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

'CrawlMapper', a more sophisticated replacement for
CrawlSplitter, created. Commit comment:

Work for [ 1261506 ] Multimachine Crawl Splitter Processor
* CrawlMapper.java
updated replacement for CrawlSplitter
can load a 'map' specifying multiple
key-range-to-crawling-node-name assignments
keeps and rotates its own destination-specific logs of
URIs that must be diverted to sibling crawl nodes
can operate on CrawlURIs or extracted outlinks (early or
late in processing chain)
* RegexpLineIterator.java
add new utility pattern useful for scanning the map
specifications


Date: 2005-09-29 20:07
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

A map of key (host/surt/queue-key) ranges to nodes could be
effected in a way analagous to SurtPrefixSet operation.

SurtPrefixSet uses TreeSet.headSet to find the last entry,
in sorted order, before the lookup value -- then checks to
see if that entry is a prefix of the lookup value, and if
so, considers the lookup value in the set.

For mapping, the TreeSet would become a TreeMap, mapping
prefixes to target nodes (by name/file/whatever). But, as
with the prefix checks, we'd be using headMap to find the
last key before where the lookup value would be (if it were
present), and take its value as the destination. (Assuming
wraparound semantics, if there is no predecessor key, the
last key would be assumed -- but we could also put in
boundary entries, '.' and '~', as necessary.)

So a map of ranges to crawlers 'A', 'B', 'C' might look like:

. A
com,r B
net,b C

...meaning, everything up to 'com,r' (exclusive) goes to A,
everything from 'com,r' (inclusive) to 'net,b' (exclusive)
goes to B, everything from 'net,b' (inclusive) to end goes to C.





Date: 2005-09-23 21:52
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Basic CrawlSplitter exists; enhancements required for next
big multi-machine crawl include:

(1) knowledge of not just URI range to handle locally, but
labelled ranges of URIs to divert, so that URIs destined for
different peer crawlers are distinguished.
(2) ability to dynamically change mapping of ranges in sync
with peer crawlers, possibly by fetching mapping from remote
shared source
(3) multiple logs on disk for the multiple 'target ranges'
of URIs (eventually becoming: multiple out channels for
direct peer-to-peer URI transmission); rotation of outgoing
logs at regular interval
(4) ability to apply mapping to either CrawlURI itself, or
its outlinks (so that splitter can be placed later in
processor chain)
(5) a configurable buffer remembering outgoing links, to
suppress (if not eliminate) duplicate sends of same URI to
peer crawlers. (Perhaps via a bloom filter with background
decay.)


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
assigned_to nobody 2005-10-03 19:00 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 7 2005-09-23 20:39 gojomo