Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 [contribution] SingleConnectionManager, range and close hdrs - ID: 1143892
Last Update: Comment added ( karl-ia )

Attached are two patches by Christian Kohlschuetter.
They add facility for choosing connection managers and
a Single Connection Manager implementation as well as
'Connection: close' header and range header when limit
on download.

Related to [ 1080925 ] MultiThreadedConnectionManager
bottleneck.


Michael Stack ( stack-sf ) - 2005-02-18 16:24

8

Closed

None

Michael Stack

None

None

Public


Comments ( 5 )

Date: 2007-03-14 01:39
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-904 -- please add further
comments at that location.


Date: 2005-03-07 21:44
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented with below commit message.

Here also are other comments by Christian regards this patch
that were sent to the list:


Hi,

I have modified FetchHTTP a little bit and would like to
contribute my
changes. Please tell me if you are interested in the
following features:

- Provide an expert configuration setting to choose the
HttpConnectionManager
(I use my own instead of MultiThreadedHttpConnectionManager)

- Be polite to the HTTP servers and send the "Range" header,
stating that you
are only interested in the first n bytes (if
max-length-bytes > 0)
This results in a "206 Partial Content" status, which
probably is better than
just cutting the full response afterwards.

- Always send a "Connection: close" header

- Shutdown the connection manager at the finalTasks step, if
possible.

The patch is attached (I was too lazy to split it into
several ones, please
excuse).

Greetings,

Christian


On Thursday 17 February 2005 17:48, stack wrote:
> I'd say we should just decide for the operator which CM to
use. Its
> detail an operator need not be concerned with.
>
> Do you find a single connection CM more performant somehow
or is it just
> that it just makes sense to you that the CM be advertised
as single
> connection only?
>
> Your patch doesn't include your single connection CM.
Send it over and
> I'll just add it in in place of the MTCM.

I think, the choosing a CM depends on the crawl scope.
Whenever long term
(inter-thread) connection reuse is possible (at some later
point where we
would allow keep-alives), a
MultiThreadedHttpConnectionManager would be the
best choice.

However, for broad crawls, it is unlikely that connections
can be reused if we
have lots of different hosts (issuing subsequent requests to
one host does
not require inter-thread connection reuse).

For my specific crawl (broad-scope, breadth-first around
ODP), I use a
ThreadLocal variant of HttpClient's SimpleHttpConnectionManager.
Right now, I do not have real good benchmarks, but it seems
that I was able to
increase the number of processed documents from about 25 to 40.

The connections are closed by a specific CloserThread,
thereby removing any
unnecessary pauses for closing sockets in the ToeThread.
Connections can be reused only if consecutive queries match
the same host
configuration.

I have attached my ThreadLocalHttpConnectionManager to this
mail. Please note
that it should remain Apache-licensed as I am probably going
to get it
incorporated into HttpClient.

> > - Be polite to the HTTP servers and send the "Range"
header, stating
(...)
> > - Always send a "Connection: close" header
>
> The above look like good additions but are HTTP/1.1
features when the
> crawler is advertising itself as http 1.0. Are you
finding evidence
> that they improve the way servers react to the crawler?

Yes, I should have stated it explicitly that these headers
were HTTP/1.1 :)

Even if the request was issued with HTTP/1.0, I sometimes
got unwanted
"Connection: keep-alive" responses, so sending the
"Connection: close" was a
solution for me.

Interestingly, after applying the Range header, 36.4%
(1,891,996 so far) of
all HTTP 2xx responses I got were "206 Partial Content", so
yes, I think it
improves behaviour. On the other hand, I just found out that
I sometimes
(0.039% of all responses) get 416 Request Range Not
Satisfiable -- when the
Content-Length was 0, for example. In this case, requests
have to be reissued
(this needs an additional patch, though).

Besides that, I somehow feel better to have a correct, but
partial response
instead of simply cutting a full response after some number
of bytes :-)


Christian


Implemented '[ 1114133 ] Add referer header' and '[ 1143892
] [contribution]
SingleConnectionManager, range and close hdrs'.
Bulk of this commit is based on patch contributed by
Christian Kohlschuetter -- ck-heritrix at newsclub dot de
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Added new defines for Range, Referer, Connection.
Added new options to enable/disable Connection:Close,
Referer, and Range
header. Default is to have Connection:close, and
Referer be on by default
and to have Range be off by default.
Using Christians new ThreadedLocalHttpConnectionManager
in place of
MultiThreadedHttpConnectionManager (This change is also
having an effect on memory retention issues -- its
improving things).
*
src/java/org/archive/httpclient/ThreadLocalHttpConnectionManager.java
Added contribution by Christian. Difference between
this SingleCM
and the one of old is that it adds timeouts and that
rather than wait on
socket close before opening new, it puts the socket
close onto a queue for
a thread in background to close at when it gets cycles.



Date: 2005-03-07 21:32
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Changed summary. Won't add choice of connection manger till
it makes more sense. Instead have this RFE be about adding
rest of Christian patches.


Date: 2005-03-02 20:02
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Evaluate adopting single CM as default if possible.

Separate out 'connection:close' and 'range:' options... make
as optional facilities if there's any risk that it will
prevent content from being retrieved.


Date: 2005-03-02 07:11
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I'd prefer that operator not have to choose the
ConnectionManager implementation -- we should just do the
right thing.

If in fact there are extra HTTP features that can be
optionally enabled, we could have settings for those (and
then if a different CM is required, just use it).




Attached Files ( 2 )

Filename Description Download
ThreadLocalHttpConnectionManager.java Single Connection Manager implementation Download
fetchhttp-improvements.patch FetchHttp improvements. Download

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-03-07 21:44 stack-sf
close_date - 2005-03-07 21:44 stack-sf
summary [contribution] Choice of Connection Manager 2005-03-07 21:32 stack-sf
File Added 121794: fetchhttp-improvements.patch 2005-02-18 16:25 stack-sf
File Added 121792: ThreadLocalHttpConnectionManager.java 2005-02-18 16:24 stack-sf