From Christian Kohlscheutter:
Dear all,
I have probably found another code hotspot, which
unnecessarily decelerates
crawling, and also leads to OOME.
After extending ToeThread report() to also show the
stacktrace for each Thread
(feature introduced with Java 5.0), I realized that
most of my 200 ToeThreads
hang in HttpClient code for several seconds, sometimes
even minutes, trying
either to add or to get cookie information from
HttpState. For example:
ToeThread #6
#6
http://www.freenet.hamilton.on.ca/Information/NEST/nature/niaghawk/statisti
cs.htm
(0 attempts)
L http://www.freenet.hamilton.on.ca/link/niaghawk/
Current processor: HTTP
ACTIVE for 8m3s803ms
Where: PROCESSING for 483813ms
org.apache.commons.httpclient.HttpState.getCookies(HttpState.java:172)
org.apache.commons.httpclient.HttpMethodBase.addCookieRequestHeader(HttpMet
hodBase.java:1183)
org.apache.commons.httpclient.HttpMethodBase.addRequestHeaders(HttpMethodBa
se.java:1307)
org.apache.commons.httpclient.HttpMethodBase.writeRequestHeaders(HttpMethod
Base.java:2027)
org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.ja
va:1912)
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:99
7)
org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.
java:117)
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:382)
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:168)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:393)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:397)
org.archive.crawler.framework.Processor.process(Processor.java:103)
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:283)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
and
ToeThread #54
#54
http://www.ibbd.org/modules.php?op=modload&name=Recommend_Us&file=index&req
=FriendSend&sid=36
(0 attempts)
LL
http://www.ibbd.org/modules.php?op=modload&name=News&file=article&sid=36&mo
de=thread&order=0&thold=0
Current processor: HTTP
ACTIVE for 25s448ms
Where: PROCESSING for 25450ms
org.apache.commons.httpclient.HttpState.addCookie(HttpState.java:124)
org.apache.commons.httpclient.HttpMethodBase.processResponseHeaders(HttpMet
hodBase.java:1502)
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.ja
va:1591)
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:99
9)
org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.
java:117)
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:382)
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:168)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:393)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:397)
org.archive.crawler.framework.Processor.process(Processor.java:103)
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:283)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
and so on.
Right now, I have crawled 1,810,981 pages from various
hosts in 13 hours.
I guess, I have accumulated many cookies during the
crawl (I cannot tell how
much).
As far as I understand, FetchHTTP lets HttpClient
always accept cookies, and
HttpClient stores them in memory-based ArrayList (!),
at HttpState. In order
to skip cookie duplicates, HttpState.addCookie seems to
*iterate* over that
list and compares each entry with the new cookie. Of
course, this takes ages
if you have a lot of cookies in that list.
Since the addCookie() and getCookie() methods are
synchronized, all other
threads which try to get the cookie list are simply
blocked. Also, the
getCookie() method is rather slow because it returns
the list of cookies as
an array (copying all the entries).
To sum it up, we have three problems here:
1. (Heritrix) There is no way to turn off cookie
support (would be a
quick-fix).
2. (Heritrix/HttpClient) Cookies are kept in memory
(instead of bdb etc.)
3. (HttpClient) HttpState access to cookies is sluggish
(by implementation and
API).
I would still regard them not as bugs but RFEs, because
breadth-first crawls
still are experimental. However, I would assign a high
priority.
....
Christian
--
Christian Kohlschütter
mailto: ck -at- NewsClub.de
Me again:
On 1. in the above, this feature was implemented before
filing this issue.
On 2. in the above, we've noticed this in the past
memory profiling.
On 4., stands to reason.
We should come up w/ a soln. that we can contribute
back to httpclient people.
Gordon Mohr
Performance
1.6.0
Public
|
Date: 2007-03-14 01:42
|
|
Date: 2005-09-27 00:06 Logged In: YES |
|
Date: 2005-08-11 22:44 Logged In: YES |
|
Date: 2005-06-06 21:53 Logged In: YES |
|
Date: 2005-06-06 17:49 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-09-27 00:06 | gojomo |
| close_date | - | 2005-09-27 00:06 | gojomo |
| artifact_group_id | None | 2005-09-23 20:53 | gojomo |
| priority | 6 | 2005-09-23 18:37 | gojomo |
| priority | 7 | 2005-08-11 22:44 | gojomo |
| assigned_to | nobody | 2005-06-04 00:18 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use