Its likely that all multimachine logic can be done
within the confines of a single processor, a 'Crawl
Splitter' (or 'Host Splitter' as its called in
Mercator) processor. There likely would be two 'Crawl
Splitter' instances in the processing chain. The
'Crawl Splitter' would look at URLs just before they
were passed to the Frontier and also as they are coming
out of the Frontier in case number-of-crawlers changed
after Frontier-insert. This processor would have a
(pluggable?) algorithm for figuring whats to be crawled
by the local host and for those URLs not meant for the
local crawler, a means of figuring which of the remote
machines it is to pass the discovered URIs to. The
Crawl Splitter would be responsible for passing URLs to
other crawlers. Ideally the alogrithm would have the
'contravarient' properties described in the ubicrawler
paper where a minimum amount of URLs are reshuffled
amongst crawlers on addition/subtraction of new crawler
instances. It would also be sweet if we could adjust
crawler 'capacity' on the fly so its possible to
lighten the load of overloaded crawlers. Optimizations
would include passing batches rather than single URLs
to remote crawlers.
Gordon Mohr
multimachine
1.6.0
Public
|
Date: 2007-03-14 01:43
|
|
Date: 2005-12-02 01:36 Logged In: YES |
|
Date: 2005-10-11 21:31 Logged In: YES |
|
Date: 2005-09-29 20:07 Logged In: YES |
|
Date: 2005-09-23 21:52 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-12-02 17:29 | stack-sf |
| close_date | - | 2005-12-02 17:29 | stack-sf |
| assigned_to | nobody | 2005-10-03 19:00 | gojomo |
| artifact_group_id | None | 2005-09-23 20:53 | gojomo |
| priority | 7 | 2005-09-23 20:39 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use