Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 Contain HttpClient HttpParser\'s OutOfMemoryError risk - ID: 1489132
Last Update: Comment added ( karl-ia )

HttpClient's HttpParser offers no default or optional
limits on header sizes or the number of headers in an
HTTP response. As a result, challenging large input can
cause the HttpParser to use an unbounded amount of
memory, causing an OutOfMemoryError.

Some discussion of this issue is in HttpClient's old
Bugzilla system at:

http://issues.apache.org/bugzilla/show_bug.cgi?id=25468

And I've filed a new issue with HttpClient's new JIRA
system at:

http://issues.apache.org/jira/browse/HTTPCLIENT-566

However, we will likely have to work around this in our
own code -- the HttpClient committers tend write off
these kinds of shortcomings as not the library's concern.

--
An URL which triggered an HttpParser-related OOME on
recent .IT crawls was:

http://peeper.axisinc.com/nph-update3.cgi

(It appears to be a faulty implementation of a
mime-multipart replace server-push functionality.)


Gordon Mohr ( gojomo ) - 2006-05-15 21:00

8

Closed

Fixed

Karl Thiessen

None

1.10.0

Public


Comments ( 2 )

Date: 2007-03-14 01:07
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-567 -- please add further
comments at that location.


Date: 2006-06-09 01:22
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

RecordingOutputStream already makes note of when headers end
and content begins, via the markContentBegin() method (and
contentBeginMark internal field).

I've added a (hardcoded) limit of 1MB for header material,
and all of the RecordingOutputStream write methods check to
ensure this limit is not exceeded; if it is, a
RecorderTooMuchHeaderException is thrown.

FetchHTTP now catches that exception and treats it
analogously to too-long and too-much-time truncations: the
material recorded so far is retained, and an annotation is
added to the crawl.log.

Commit comment:

Fix for [ 1489132 ] Contain HttpClient HttpParser's
OutOfMemoryError risk
* RecordingOutputStream.java
check size againt new MAX_HEADER_MATERIAL limit until
contentBeginMark is set; throw
RecorderTooMuchHeaderException if limit exceeded
* RecorderTooMuchHeaderException.java
exception to indicate excessive header material
* FetchHTTP.java
catch the new exception and treat like other time/length
truncations

Assigning to Karl for verification and possible development
of regression test.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
summary Contain HttpClient HttpParser's OutOfMemoryError risk 2006-09-11 22:03 karl-ia
status_id Open 2006-09-11 22:03 karl-ia
close_date - 2006-09-11 22:03 karl-ia
assigned_to gojomo 2006-06-09 01:22 gojomo
resolution_id None 2006-06-09 01:22 gojomo
artifact_group_id None 2006-06-01 23:27 gojomo
assigned_to nobody 2006-06-01 23:27 gojomo