Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 When multiple instances, there's always a runt in the litter - ID: 1415942
Last Update: Comment added ( karl-ia )

Running multiple instances of heritrix in single
container, one always lags badly. Studying it, seems
like threads are mostly just waiting. Even pausing the
others, the laggard behavior keeps up... even when up
the number of threads and looks like plenty of queues
to go work on.


Michael Stack ( stack-sf ) - 2006-01-27 02:53

8

Closed

Fixed

Michael Stack

configuration

1.8.0

Public


Comments ( 2 )

Date: 2007-03-14 01:04
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-537 -- please add further
comments at that location.


Date: 2006-01-27 22:18
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below is commit with outline of issue.

TO TEST:

Setup an infiniteurl instance generating many new hosts
quickly (This problem is probably easiest seen with
infiniteurl; I was able to reproduce using two instances of
3 threads only in each).

Create 3 instances of 3 threads each with no delay between
fetches. Tail prog-stats for all 3 instances. You'll see
an odd 0 for thread instances but usually they'll be all
occupied (Before patch, you'd see at least one reporting 0
occupied threads for tens of minutes of prog-stats).


Commit message:

Fix for [ 1415942 ] When multpile instances, there's always
a runt in the litter

All heritrice instances were using the last servercache
instance manufactured.
Meant that instances would do lookup, the server wouldn't be
found, and the
fetch would fail with a -2. Eventually all threads would be
snoozed waiting on
retry because this looked like transient network failure.

Setup of the httpclient on each FetchHTTP instantiation
involves passing it a
socket factory for http and https sockets. We have our own
instance that gets
IPs from servercache. Only, our socket factories were
singletons. Each
FetchHTTP creation per heritrix instance passed in the
current servercache (by
passing in a controller) overwriting the previous
servercache reference kept in
the singleton socket factory.

Fixed by making socket factories dynamic.

* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Removed unnecessary cast. Removed unneeded cleanup of
HeritrixProtocolSocketFactory now its dynamic. The
Socket Factory
API changed. Pass the ServerCache rather than
controller to the
http and https socket factories.
*
src/java/org/archive/crawler/fetcher/HeritrixProtocolSocketFactory.java
Change from being singleton to dynamic. Take ServerCache
in constructor
instead of on initialization.
(initialize, cleanup): Removed.
(getHostAddress): Takes cache argument.
*
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
Change from being singleton to dynamic. Take ServerCache
in constructor
instead of on initialization.
(getHostAddress): Removed. Call partner class
HeritrixProtocolSocketFactory
explicitly.
*
src/java/org/archive/httpclient/ConfigurableX509TrustManagerTest.java
API for ssl socket factory changed. Now takes a
servercache instance.



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
artifact_group_id None 2006-03-17 19:58 gojomo
status_id Open 2006-01-27 22:18 stack-sf
resolution_id None 2006-01-27 22:18 stack-sf
summary When multile instances, there's always a runt in the litter 2006-01-27 22:18 stack-sf
close_date - 2006-01-27 22:18 stack-sf