Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Cookies are thread traffic jam and memory hog - ID: 1208757
Last Update: Comment added ( karl-ia )

From Christian Kohlscheutter:

Dear all,

I have probably found another code hotspot, which
unnecessarily decelerates
crawling, and also leads to OOME.

After extending ToeThread report() to also show the
stacktrace for each Thread
(feature introduced with Java 5.0), I realized that
most of my 200 ToeThreads
hang in HttpClient code for several seconds, sometimes
even minutes, trying
either to add or to get cookie information from
HttpState. For example:

ToeThread #6
#6
http://www.freenet.hamilton.on.ca/Information/NEST/nature/niaghawk/statisti
cs.htm
(0 attempts)
L http://www.freenet.hamilton.on.ca/link/niaghawk/
Current processor: HTTP
ACTIVE for 8m3s803ms
Where: PROCESSING for 483813ms

org.apache.commons.httpclient.HttpState.getCookies(HttpState.java:172)
org.apache.commons.httpclient.HttpMethodBase.addCookieRequestHeader(HttpMet
hodBase.java:1183)
org.apache.commons.httpclient.HttpMethodBase.addRequestHeaders(HttpMethodBa
se.java:1307)
org.apache.commons.httpclient.HttpMethodBase.writeRequestHeaders(HttpMethod
Base.java:2027)
org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.ja
va:1912)
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:99
7)
org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.
java:117)
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:382)
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:168)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:393)

org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)

org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:397)
org.archive.crawler.framework.Processor.process(Processor.java:103)
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:283)

org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

and

ToeThread #54
#54
http://www.ibbd.org/modules.php?op=modload&name=Recommend_Us&file=index&req
=FriendSend&sid=36
(0 attempts)
LL
http://www.ibbd.org/modules.php?op=modload&name=News&file=article&sid=36&mo
de=thread&order=0&thold=0
Current processor: HTTP
ACTIVE for 25s448ms
Where: PROCESSING for 25450ms

org.apache.commons.httpclient.HttpState.addCookie(HttpState.java:124)
org.apache.commons.httpclient.HttpMethodBase.processResponseHeaders(HttpMet
hodBase.java:1502)
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.ja
va:1591)
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:99
9)
org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.
java:117)
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:382)
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:168)
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:393)

org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)

org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:397)
org.archive.crawler.framework.Processor.process(Processor.java:103)
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:283)

org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

and so on.

Right now, I have crawled 1,810,981 pages from various
hosts in 13 hours.
I guess, I have accumulated many cookies during the
crawl (I cannot tell how
much).

As far as I understand, FetchHTTP lets HttpClient
always accept cookies, and
HttpClient stores them in memory-based ArrayList (!),
at HttpState. In order
to skip cookie duplicates, HttpState.addCookie seems to
*iterate* over that
list and compares each entry with the new cookie. Of
course, this takes ages
if you have a lot of cookies in that list.

Since the addCookie() and getCookie() methods are
synchronized, all other
threads which try to get the cookie list are simply
blocked. Also, the
getCookie() method is rather slow because it returns
the list of cookies as
an array (copying all the entries).

To sum it up, we have three problems here:
1. (Heritrix) There is no way to turn off cookie
support (would be a
quick-fix).
2. (Heritrix/HttpClient) Cookies are kept in memory
(instead of bdb etc.)
3. (HttpClient) HttpState access to cookies is sluggish
(by implementation and
API).

I would still regard them not as bugs but RFEs, because
breadth-first crawls
still are experimental. However, I would assign a high
priority.

....


Christian
--
Christian Kohlschütter
mailto: ck -at- NewsClub.de

Me again:

On 1. in the above, this feature was implemented before
filing this issue.
On 2. in the above, we've noticed this in the past
memory profiling.
On 4., stands to reason.

We should come up w/ a soln. that we can contribute
back to httpclient people.


Michael Stack ( stack-sf ) - 2005-05-25 21:08

7

Closed

None

Gordon Mohr

Performance

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 01:42
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-940 -- please add further
comments at that location.


Date: 2005-09-27 00:06
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

The challenge with contributing the change back to
HttpClient is ensuring that the Iterators created from the
configured SortedMap are closed properly (when they happen
to be BDB StoredIterators).

There's no quick/elegant fix; discussion with Sleepycats
just highlighted other issues.

So, considering this issue fixed, even without doing the
contribution back to HttpClient at this time. Closing.


Date: 2005-08-11 22:44
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Note: this has been resolved in our code as of the 6/6
commit, but I'd like to contribute the change back to
HttpClient, if the Bdb depedency can be eliminated
(StoredIterator.close()). Lowering the priority.



Date: 2005-06-06 21:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Using Sets instead of Maps would present a challenge in
defining the boundaries for sub(set|map)s -- couldn't use a
synthesized String as easily, might have to use a
synthesized Cookie instance. So sticking with Maps-based
approach for now.

Use of StoredIterator.close(), just in case iterators are
over stored collections, will probably make exact current
code uninteresting to HttpClient project for integration...
needs further investigation how we can enable
StoredSortedMap usage without iterator left-open-leaks.


Date: 2005-06-06 17:49
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

HttpClient classes changed to use SortedMap instead of
ArrayList; narrowly look for matching cookies from those
with plausible domains, rather than whole list. FetchHTTP
changed to optionally insert bdb-backed StoredSortedMap in
place of default HttpClient cookie map.

HttpClient changes are in the form of our own versions of
those source files until facility can be
packaged/contributed/integrated to main HttpClient.

Commit comment:
Implementation for [ 1208757 ] Cookies are thread traffic
jam and memory hog
* org.apache.commons.httpclient.Cookie.java
add a 'sort key' that places candidate cookies (same
domain) adjacent in sort order
* org.apache.commons.httpclient.HttpState.java
change cookies to SortedMap; allow replacement of
default map
* org.apache.commons.httpclient.HttpMethodBase.java
use updated map-based match()
* org.apache.commons.httpclient.cookie.CookieSpec.java
add map based match() prototype; deprecate array-based
* org.apache.commons.httpclient.cookie.CookieSpecBase.java
add map-based match() implementation, deprecate array-based
also: apply fix to domainMatch from HttpClient bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=35225
* org.archive.crawler.fetcher.FetchHTTP.java
enable optional use of Bdb-backed StoredSortedMap for
cookies (default: bdb is used)

Looking now, I think SortedSet and StoredSortedValueSet with
an appropriate Comparator could be used in place of the maps
and 'sort key' approach; may update to do that later.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-09-27 00:06 gojomo
close_date - 2005-09-27 00:06 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 6 2005-09-23 18:37 gojomo
priority 7 2005-08-11 22:44 gojomo
assigned_to nobody 2005-06-04 00:18 gojomo