Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 [contrib] API for replicating already-seen across instances - ID: 1547216
Last Update: Comment added ( karl-ia )

This simple patch provides a configuration option and
plugin API for replicating the already-seen (already
included) list across heritrix instances. This is a
simple solution to one of the first steps at a
distributed heritrix. Although it does nothing for
choosing which URI to distribute to which instance,
etc., it simply ensures that heritrix instances don't
crawl the same URI's twice. Complete replication is
the easiest (and fastest access) way to do this. Note
that this API does nothing for automatically adding new
instances to the group, they must be loaded with the
full list manually for now. It just ensures every
instance maintains the same, replicated, already-seen
while the group is running.

The implementation of such replication is likely to
depend on further libraries, so it is left to
instantiation via reflection and the build of it is
external to the default heritrix binary distribution
for now. My own personal JGroups implementation is
forthcoming.

Note this patch is made against 1.8.0, but is very
simple so it should port to the 1.9/10 head easily.


Eric C. Jensen ( ecjensen ) - 2006-08-26 17:05

5

Closed

None

Nobody/Anonymous

multimachine

1.8.0

Public


Comments ( 8 )

Date: 2007-03-14 01:50
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-1044 -- please add further
comments at that location.


Date: 2006-08-30 18:05
Sender: ecjensen

Logged In: YES
user_id=705615

Just added "stop" to API to stop the replicator and a call
to it in crawlEnded. I also tested to make sure it patches
and builds on 1.9 again. Since I added serialVersionUID for
1.8, that chunk fails on 1.9 which is what is desirable.
This patch doesn't require a serialVersionUID change since
the replicator is transient. This should be the final API
change.


Date: 2006-08-29 22:32
Sender: ecjensen

Logged In: YES
user_id=705615

You're absolutely right, this does nothing for partitioning
the crawler state to enable larger crawls. Its sole
purpose is to make crawls heritrix is currently capable of
faster. However, I think it's important to keep in mind
that many people are not using heritrix to crawl the entire
web, but rather to do focused crawls of particular
domains. In that case, distribution for the sake of
speeding up the crawl is important, and any network
requests required to query the already-seen list are just
going to slow it down.

CrawlMapper is a good start at attacking the distribution
problem from a "these URL's should be crawled by this
instance" point of view, but it can't provide the
same "just don't crawl URL's other instances have crawled"
functionality as replicating the already seen list.
Rather, this greedy solution enables those who need
multiple heritrix instances for speed purposes to skip the
problem of mapping URI's to instances, which has associated
request/response messaging problems to solve as I described
in my last comment. Simply replicating the already-seen
just works without having to set up scripts to aggregate
and import URI's from the mapping logs or coordinate at all
between instances. At the very least, I think it's a very
non-invasive complementary strategy (leaving my example
JGroups-specific implementation out of heritrix proper and
just including the replication API).

I'm not particular about the class names, but "Distributor"
seems to suggest some kind of partitioning to me, which my
patch is not doing. I'd be fine with "Notifier" or
something related to the fact that the API is really just
about broadcasting unique URI's.



Date: 2006-08-29 22:03
Sender: ecjensen

Logged In: YES
user_id=705615

I was looking for the simplest way to run a crawl using
multiple instances of the current version of heritrix. As
such, this API is basic, but it is not incomplete. In
fact, I'm currently using it and the very basic replicator
class also attached to run a very large crawl with more
than ten instances of heritrix on a LAN. In the long run,
there's lots of thinking to be done about the best way to
modify heritrix' architecture to make it a flexible,
efficient distributed crawler. For now, this works great,
efficiently, and is very non-invasive.

I obviously didn't explain how my patch works adequately.
I named it "replication" because it means literally that
the entire already-seen list is replicated on every
instance. Thus, querying already-seen is always a local
operation...no requests to other hosts are ever made. This
ensures querying it is as fast as in a single heritrix
instance, and eliminates the complexity of needing a
request/response (messaging) framework with its associated
problems of timeouts, etc. Rather, when a unique URI is
found, it is broadcast to be pushed into every other
instances already-seen list asynchronously so that they
won't crawl that same URI. Thus, the UriUniqFilter used
must have a thread-safe "note" method. I don't believe any
of the other ones except for BdbUriUniqFilter have this,
but I could be wrong.

With just this simple functionality, distributing a crawl
across multiple heritrix instances is a simple matter of
starting them all up with similar order.xml's (all this
actually requires is identical canonicalization),
partitioning the seeds file across instances, and letting
them rip. Obviously, this doesn't provide very good load
balancing if you're not using many disparate seeds. But,
for many situations (including mine) where the crawl is
easily partitioned by the seeds this solves the problem
entirely.

Since there's no distributed state (only replicated),
dynamically adding and removing instances is also not a
problem, simply checkpoint one instance and start your new
instance by recovering from that checkpoint. It will then
have the current already-seen and immediately begin
participating in the broadcasts. There's actually an API
in the JGroup's MessageBus that I use in my implemenation
which would deal with catching-up state when an instance
joins, but even that is more complex than this patch needs
to be since it would require knowing about the specific
UriUniqFilter implementation being used on each instance
(right now it doesn't matter if you use a Bloom on one and
a Bdb on another). I suppose if you're going to add new
instances, you would need to either come up with a new seed
or have been using depth-first crawling so that you could
use some of the later seeds from your original instances if
they hadn't gotten to them yet.

I agree with you that subclassing the frontier is probably
the way to go in the long run. I think that would be where
you'd want to solve the next obvious issue of balancing the
load by partitioning across something smarter than seeds
(probably domain, alphabetically or whatever people usually
do). But, this patch provides a perfectly functional,
efficient distributed heritrix without making any of the
larger architecture changes that would require, and without
having to worry about all the request/response issues of
scheduling URI's on remote instances, etc.


Date: 2006-08-29 21:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Thanks for the explanatory note. Made me look into the
jgroup patch (smile). There I saw the jgroup listener
implementation, the other end of the jgroup bus, adding URLs
to local already-seen.

Tell me this: Do you find that you are able to do bigger
crawls using your replication implementation? Or is it just
that you can crawl more in a shorter amount of time?

At least for us, the already-seen starts to get in the way
of large crawls. The bdbfilter slows to a crawl at about
30/50M entries and the bloom filter starts to clog at 120M
(unless its allocated more memory). Rather than dumping all
thats already-seen by a distributed crawl into every
crawler's already-seen -- making it so already-seen gets
clogged even faster? -- I'd think we'd want to distribute
already-seen or move it outside of the heritrix instance to
some purposed already-seen application, one with loads of
memory dedicated to lookups only, not to lookups + crawling?

With the above in mind, 'replication' seems to be of limited
use and why I view the API as 'incomplete'.

Did you take a look at the CrawlMapper class? Wouldn't it
get you to the same end? You'd have to add in something
like your JGroups Bus to pass the URIs between hosts but it
would the advantage of 'distributing' the already-seen.

Point taken on the ThreadSafe marker API. Currently, we
assume only frontier will be making addition whereas in your
replication case, both jgroup listener and frontier can be
making addition.

For sure I'd say, the frontier additions should go into a
subclass, a distributed frontier subclass.

I'd suggest Distributor rather than Replicator as the
qualifier used on your UriUniq interface.


Date: 2006-08-29 20:28
Sender: ecjensen

Logged In: YES
user_id=705615

I was looking for the simplest way to run a crawl using
multiple instances of the current version of heritrix. As
such, this API is basic, but it is not incomplete. In
fact, I'm currently using it and the very basic replicator
class also attached to run a very large crawl with more
than ten instances of heritrix on a LAN. In the long run,
there's lots of thinking to be done about the best way to
modify heritrix' architecture to make it a flexible,
efficient distributed crawler. For now, this works great,
efficiently, and is very non-invasive.

I obviously didn't explain how my patch works adequately.
I named it "replication" because it means literally that
the entire already-seen list is replicated on every
instance. Thus, querying already-seen is always a local
operation...no requests to other hosts are ever made. This
ensures querying it is as fast as in a single heritrix
instance, and eliminates the complexity of needing a
request/response (messaging) framework with its associated
problems of timeouts, etc. Rather, when a unique URI is
found, it is broadcast to be pushed into every other
instances already-seen list asynchronously so that they
won't crawl that same URI. Thus, the UriUniqFilter used
must have a thread-safe "note" method. I don't believe any
of the other ones except for BdbUriUniqFilter have this,
but I could be wrong.

With just this simple functionality, distributing a crawl
across multiple heritrix instances is a simple matter of
starting them all up with similar order.xml's (all this
actually requires is identical canonicalization),
partitioning the seeds file across instances, and letting
them rip. Obviously, this doesn't provide very good load
balancing if you're not using many disparate seeds. But,
for many situations (including mine) where the crawl is
easily partitioned by the seeds this solves the problem
entirely.

Since there's no distributed state (only replicated),
dynamically adding and removing instances is also not a
problem, simply checkpoint one instance and start your new
instance by recovering from that checkpoint. It will then
have the current already-seen and immediately begin
participating in the broadcasts. There's actually an API
in the JGroup's MessageBus that I use in my implemenation
which would deal with catching-up state when an instance
joins, but even that is more complex than this patch needs
to be since it would require knowing about the specific
UriUniqFilter implementation being used on each instance
(right now it doesn't matter if you use a Bloom on one and
a Bdb on another). I suppose if you're going to add new
instances, you would need to either come up with a new seed
or have been using depth-first crawling so that you could
use some of the later seeds from your original instances if
they hadn't gotten to them yet.

I agree with you that subclassing the frontier is probably
the way to go in the long run. I think that would be where
you'd want to solve the next obvious issue of balancing the
load by partitioning across something smarter than seeds
(probably domain, alphabetically or whatever people usually
do). But, this patch provides a perfectly functional,
efficient distributed heritrix without making any of the
larger architecture changes that would require, and without
having to worry about all the request/response issues of
scheduling URI's on remote instances, etc.


Date: 2006-08-29 19:41
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here's a couple of comments:

+ This API seems incomplete. Its just doing 'replication' of
already-seen. This seems of little use without a frontier
being able to somehow check the store that holds the
replicated URIs. Missing is a means of chaining the
already-seen lookups: a check of the local already-seen and
then a check of the alternative (distributed?) already-seen.
+ Why not subclass bdbfrontier and add the 'distributed'
aspects there?
+ Regards the 'thread-safe' marker interface, should we just
change the doc. on the UriUniqFilter to state that
implementations are assumed thread safe?

Would the following work for you Eric as a means of getting
your replicating implemenation going without having to have
it in core Heritrix?

Build a jar that had your subclasses of the Heritrix
Frontier and UriUniqFilters to add your 'replication' that
also included your new interfaces/classes and your JGroup
implementation. It would also include a src/conf/modules/*
file that listed your your new Frontier and processors so it
became available in the interface.

With luck you should be able to just drop this new jar into
the $HERITRIX_HOME/lib directory and have your new facility
available in the crawler.


Date: 2006-08-28 21:58
Sender: ecjensen

Logged In: YES
user_id=705615

I have made some small modifications to this API, now it is
final...patch applies to 1.8 and the latest head with some
fuzziness.


Attached Files ( 2 )

Filename Description Download
uri_unique_replication.patch fixed fixed fixed (added "transient" and serialVersionUID and stop) patch to support uri uniq replication Download
NotificationBusUniqUriReplicator.java fixed fixed JGroups already-seen replicator using their NotificationBus building block with debug info Download

Changes ( 20 )

Field Old Value Date By
close_date - 2007-03-14 01:50 karl-ia
status_id Open 2007-03-14 01:50 karl-ia
File Deleted 191238: 2006-08-30 18:06 ecjensen
File Added 191372: NotificationBusUniqUriReplicator.java 2006-08-30 18:06 ecjensen
File Deleted 191237: 2006-08-30 18:05 ecjensen
File Added 191371: uri_unique_replication.patch 2006-08-30 18:05 ecjensen
File Deleted 191035: 2006-08-30 02:34 ecjensen
File Added 191238: NotificationBusUniqUriReplicator.java 2006-08-30 02:34 ecjensen
File Deleted 191216: 2006-08-30 02:33 ecjensen
File Deleted 191204: 2006-08-30 02:33 ecjensen
File Added 191237: uri_unique_replication.patch 2006-08-30 02:33 ecjensen
File Added 191216: uri_unique_replication.patch 2006-08-29 22:03 ecjensen
File Added 191204: uri_unique_replication.patch 2006-08-29 20:28 ecjensen
File Deleted 191150: 2006-08-29 20:28 ecjensen
File Added 191150: uri_unique_replication.patch 2006-08-29 14:17 ecjensen
File Deleted 191034: 2006-08-29 14:17 ecjensen
File Added 191035: NotificationBusUniqUriReplicator.java 2006-08-28 21:59 ecjensen
File Deleted 190779: 2006-08-28 21:58 ecjensen
File Added 191034: uri_unique_replication.patch 2006-08-28 21:58 ecjensen
File Added 190779: uri_unique_replication.patch 2006-08-26 17:05 ecjensen