Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 MultiThreadedConnectionManager bottleneck - ID: 1080925
Last Update: Comment added ( karl-ia )

In a crawl testing other features, I've reached a
situation where most (and often all) 100 toethreads are
held up in FetchHTTP at any point in time waiting for a
connection from the MultiThreadedConnectionManager. As
a result, progress has become very slow.

A representative stack dump (from a thread suspended in
a debugger):

Thread [ToeThread #1] (Suspended)
Object.wait(long) line: not available [native method]
MultiThreadedHttpConnectionManager.doGetConnection(HostConfiguration,
long) line: 497
MultiThreadedHttpConnectionManager.getConnectionWithTimeout(HostConfigurat
ion,
long) line: 382
HttpMethodDirector.executeMethod(HttpMethod) line: 161
HttpClient.executeMethod(HostConfiguration,
HttpMethod, HttpState) line: 437
HttpClient.executeMethod(HttpMethod) line: 324
FetchHTTP.innerProcess(CrawlURI) line: 306
FetchHTTP(Processor).process(CrawlURI) line: 102
ToeThread.processCrawlUri() line: 264
ToeThread.run() line: 140

This test run has been through an atypical series of
steps, including several all-thread breakpoint-stops
and resumes, and during-crawl increases and decreases
in the count of ToeThreads. (The decreases did not take
effect as they should; a separate bug will be filed.)
It appears the cause is that the crawl was initially
started with max-toe-threads of 2, so the connection
manager's maxTotalConnections is set to 4 in
FetchHTTP.configureHttp(), and never increases when the
number of ToeThreads is increased.

However, in no case should we let HttpClient's
connection pooling be a bottleneck. We don't yet need
the biggest reason for connection pooling -- reusing an
already-open connection. We have no need for any cap on
the number of connections created, beyond the inherent
limit that any ToeThread will only ask for one at a time.

So the pooling logic (and background thread) overhead
of this connectionmanager only saves us the occasional
reallocation of an HttpConnection object. I'm not sure
there's any noticeable savings given the way any
returned reused connection gets wrapped in a new
HttpConnectionAdapter (subclass of HttpConnection)
before being returned.

I think the most efficient fix would be to dump
MultiThreadedHttpConnectionManager in favor of a
homegrown simpleminded connection 'manager' that just
supplies a new connection each time one is needed.

Only if this reallocation is shown to be costly would
we try to reuse HttpConnections, and then we could
establish a system of one cached connection per client
thread (like the one HttpRecorder per thread).



Gordon Mohr ( gojomo ) - 2004-12-07 22:32

6

Closed

Fixed

Michael Stack

Protocols

None

Public


Comments ( 6 )

Date: 2007-03-14 00:18
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-306 -- please add further
comments at that location.


Date: 2005-03-07 22:41
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

'[ 1143892 ] [contribution] SingleConnectionManager, range
and close hdrs' moved us back to a single connection
manager. No more MultiThreadedHttpConnectionManager.

Closing. Marking as 'fixed'.


Date: 2005-02-15 18:31
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

A comment was made on this issue out on the list (Upping
priority):

... I also think that the current
MultiThreadedConnectionManager does not make much sense if
connections will
not be reused anyway.

Christian
--
Christian Kohlschütter
mailto: ck -at- NewsClub.de


Date: 2004-12-09 19:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Comment on Socket#connect sounds right.

I did the below commit for now to take heat off this issue.

Trying to bring along the simple CM that had been in place
before 3.0.x lib upgrade, sockets were being reused before
being properly closed.

Take heat off '[ 1080925 ] MultiThreadedConnectionManager
bottleneck' by
setting a minimum on max total connections for httpclient.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Added lower minimum on httpclient connection manager max
total connections.
Formatting, javadoc warning fixes, and eclipse
recommended optimizations.
(getMaxImmediateRetries): Removed. No longer used.
(getHttp): Added. Eclipse suggested optimization.



Date: 2004-12-08 20:15
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

A temporary workaround I'm using locally is just to use a
maxTotalConnections value so high (10000) that practically,
the manager will never refuse to create a new instance.

I don't see a place where the HttpClient code actually
reuses an old Socket instance to connect to a new remote
host -- I think that would require the use of a
Socket.connect(..) method, and I can't find any references
to such methods in HttpClient...


Date: 2004-12-07 23:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Taking this issue. Usage is atypical so will give low
priority. Sockets allocations are expensive so we should
reuse if we can. Will investigate savings from reuse. Will
make sure pool is never less ToeThreads.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-03-07 22:41 stack-sf
resolution_id None 2005-03-07 22:41 stack-sf
close_date - 2005-03-07 22:41 stack-sf
priority 4 2005-02-15 18:31 stack-sf
assigned_to nobody 2004-12-07 23:17 stack-sf
priority 5 2004-12-07 23:17 stack-sf