In a crawl testing other features, I've reached a
situation where most (and often all) 100 toethreads are
held up in FetchHTTP at any point in time waiting for a
connection from the MultiThreadedConnectionManager. As
a result, progress has become very slow.
A representative stack dump (from a thread suspended in
a debugger):
Thread [ToeThread #1] (Suspended)
Object.wait(long) line: not available [native method]
MultiThreadedHttpConnectionManager.doGetConnection(HostConfiguration,
long) line: 497
MultiThreadedHttpConnectionManager.getConnectionWithTimeout(HostConfigurat
ion,
long) line: 382
HttpMethodDirector.executeMethod(HttpMethod) line: 161
HttpClient.executeMethod(HostConfiguration,
HttpMethod, HttpState) line: 437
HttpClient.executeMethod(HttpMethod) line: 324
FetchHTTP.innerProcess(CrawlURI) line: 306
FetchHTTP(Processor).process(CrawlURI) line: 102
ToeThread.processCrawlUri() line: 264
ToeThread.run() line: 140
This test run has been through an atypical series of
steps, including several all-thread breakpoint-stops
and resumes, and during-crawl increases and decreases
in the count of ToeThreads. (The decreases did not take
effect as they should; a separate bug will be filed.)
It appears the cause is that the crawl was initially
started with max-toe-threads of 2, so the connection
manager's maxTotalConnections is set to 4 in
FetchHTTP.configureHttp(), and never increases when the
number of ToeThreads is increased.
However, in no case should we let HttpClient's
connection pooling be a bottleneck. We don't yet need
the biggest reason for connection pooling -- reusing an
already-open connection. We have no need for any cap on
the number of connections created, beyond the inherent
limit that any ToeThread will only ask for one at a time.
So the pooling logic (and background thread) overhead
of this connectionmanager only saves us the occasional
reallocation of an HttpConnection object. I'm not sure
there's any noticeable savings given the way any
returned reused connection gets wrapped in a new
HttpConnectionAdapter (subclass of HttpConnection)
before being returned.
I think the most efficient fix would be to dump
MultiThreadedHttpConnectionManager in favor of a
homegrown simpleminded connection 'manager' that just
supplies a new connection each time one is needed.
Only if this reallocation is shown to be costly would
we try to reuse HttpConnections, and then we could
establish a system of one cached connection per client
thread (like the one HttpRecorder per thread).
Michael Stack
Protocols
None
Public
|
Date: 2007-03-14 00:18
|
|
Date: 2005-03-07 22:41 Logged In: YES |
|
Date: 2005-02-15 18:31 Logged In: YES |
|
Date: 2004-12-09 19:13 Logged In: YES |
|
Date: 2004-12-08 20:15 Logged In: YES |
|
Date: 2004-12-07 23:17 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-03-07 22:41 | stack-sf |
| resolution_id | None | 2005-03-07 22:41 | stack-sf |
| close_date | - | 2005-03-07 22:41 | stack-sf |
| priority | 4 | 2005-02-15 18:31 | stack-sf |
| assigned_to | nobody | 2004-12-07 23:17 | stack-sf |
| priority | 5 | 2004-12-07 23:17 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use