Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Upgrade httpclient to 3.0.x - ID: 1037304
Last Update: Comment added ( karl-ia )

Making an issue for this upgrade because it took a
bunch of time and so it gets listed in the features
added for 1.2.


Michael Stack ( stack-sf ) - 2004-09-29 21:15

5

Closed

None

Michael Stack

None

None

Public


Comments ( 2 )

Date: 2007-03-14 01:34
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-832 -- please add further
comments at that location.


Date: 2004-09-29 21:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed. Heres the message used:

Fix for [ 1037304 ] Upgrade httpclient to 3.0.x
New httpclient adds improved performance, configuration
granularity, timeouts
on ssl connections. The IBM 141 JVM throws NPEs connecting
to https setting
the TcpNoDelay on ssl sockets with timeouts. The IBM 142
JVM says socket not
connected when setting timeouts on https sockets. Sun JVM
works fine.
Reintroduced MultiThreadedHttpConnectionManager in place of
SingleHttpConnectionManager; With SHCM, a close was shutting
down another
threads stream. This commit also allows setting commandline
args in
heritrix.properites. The commmandline content overrides
whatevers found in
heritrix.properties.
* .classpath
* project.properties
* project.xml
Updated commons-logging from 1.0.3 to 1.0.4.
Added commons-codec-1.3. Needed by httpclient.
Upgraded httpclient from 2.0 to 3.0-alpha2.
* src/conf/heritrix.properties
Added properties for port and login.
* src/java/org/archive/crawler/Heritrix.java
Look for commandline defaults from heritrix.properties.
(DEFAULT_ENCODING): Added.
* src/java/org/archive/crawler/datamodel/UURI.java
Javadoc. Changes to match changes in the parent API,
mostly the fact
that parent now takes a boolean of whether the URI is
escaped or not.
(getHost): If host is null -- i.e. dns -- parent now
throws an exception
so check for null before going int there.
* src/java/org/archive/crawler/datamodel/UURIFactory.java
Changes to match parent api changes.
(create): Added validity check to override.
(isEscaped): Make it public so can be used by UURI.
*
src/java/org/archive/crawler/datamodel/credential/Rfc2617Credential.java
Refactoring to match rewritten parent auth model.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Use HttpMethod, the super for GetMethod and PostMethod,
rather than
GetMethod explicitly.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Refactoring to exploit new facility in httpclient 3.0.
Removed the immediateRetry faciltiy. The httpclient 3.0
now does retries
itself internally. Set cookies all in one header,
non-fail if
abiguous status line, non-strict transfer encoding and
read ten lines
of garbage before giving up on getting a status.
(handle401): Refactored because of new auth system in
httpclient 3.0.
(getAuthScheme): Refactored because of new auth system
in httpclient 3.0.
(setupHttp): Set connection and socket timeouts. Set
max connections and
max per host. Set it so we use Nagle's alogarithm
(Conserves bandwidth).
*
src/java/org/archive/httpclient/ConfigurableTrustManagerProtocolSocketFactory.java
Set a timeout on created ssl socket if timeout is non-zero.
* src/java/org/archive/httpclient/HttpRecorderGetMethod.java
* src/java/org/archive/httpclient/HttpRecorderPostMethod.java
HttpRecoverableException no longer exists.
* src/java/org/archive/io/arc/ARCConstants.java
Added DEFAULT_ENCODING
* src/java/org/archive/io/arc/ARCRecord.java
Use new ARCConstants constant for DEFAULT_ENCODING.
* src/java/org/archive/io/arc/ARCWriter.java
Use new ARCConstants constant for DEFAULT_ENCODING.
* src/java/org/apache/commons/httpclient/HttpConnection.java
Updated the HttpConnection overlay to be that from
httpclient 3.0.
Here is the patch of what we add in the overlay:
[debord 964] heritrix > diff -u
~/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpConnection.java
src/java/org/apache/commons/httpclient/HttpConnection.java
---
/home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpConnection.java
2004-09-19 13:41:08.000000000 -0700
+++
src/java/org/apache/commons/httpclient/HttpConnection.java
2004-09-29 12:42:20.000000000 -0700
@@ -49,6 +49,9 @@
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

+// HERITRIX import.
+import org.archive.util.HttpRecorder;
+
/**
* An abstraction of an HTTP {@link InputStream} and {@link
OutputStream}
* pair, together with the relevant attributes.
@@ -676,7 +679,6 @@
highly interactive environments, such as some
client/server
situations. In such cases, nagling may be
turned off through
use of the TCP_NODELAY sockets option." */
-
socket.setTcpNoDelay(this.params.getTcpNoDelay());
socket.setSoTimeout(this.params.getSoTimeout());

@@ -701,8 +703,23 @@
if (inbuffersize > 2048) {
inbuffersize = 2048;
}
- inputStream = new
BufferedInputStream(socket.getInputStream(), inbuffersize);
- outputStream = new
BufferedOutputStream(socket.getOutputStream(), outbuffersize);
+ // START HERITRIX Change
+ HttpRecorder httpRecorder =
HttpRecorder.getHttpRecorder();
+ if (httpRecorder == null) {
+ inputStream = new BufferedInputStream(
+ socket.getInputStream(), inbuffersize);
+ outputStream = new BufferedOutputStream(
+ socket.getOutputStream(), outbuffersize);
+ } else {
+ inputStream =
httpRecorder.inputWrap((InputStream)
+ (new
BufferedInputStream(socket.getInputStream(),
+ inbuffersize)));
+ outputStream =
httpRecorder.outputWrap((OutputStream)
+ (new
BufferedOutputStream(socket.getOutputStream(),
+ outbuffersize)));
+ }
+ // END HERITRIX change.
+
isOpen = true;
used = false;
} catch (IOException e) {
* src/java/org/apache/commons/httpclient/HttpParser.java
Updated the HttpConnection overlay to be that from
httpclient 3.0.
Here is the patch of what we add in the overlay: [debord
966] heritrix > diff -u
~/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpParser.java
src/java/org/apache/commons/httpclient/HttpParser.java ---
/home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpParser.java
2004-09-19 13:41:05.000000000 -0700
+++ src/java/org/apache/commons/httpclient/HttpParser.java
2004-09-29 14:23:03.000000000 -0700
---
/home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpParser.java
2004-09-19 13:41:05.000000000 -0700
+++ src/java/org/apache/commons/httpclient/HttpParser.java
2004-09-29 14:23:03.000000000 -0700
@@ -185,11 +185,21 @@
// Otherwise we should have normal HTTP
header line
// Parse the header name and value
int colon = line.indexOf(":");
+ // START HERITRIX Change
+ // Don't throw an exception if can't parse.
We want to keep
+ // going even though header is bad. Rather,
create
+ // pseudo-header.
if (colon < 0) {
- throw new ProtocolException("Unable to
parse header: " + line);
+ // throw new ProtocolException("Unable
to parse header: " +
+ // line);
+ name =
"HttpClient-Bad-Header-Line-Failed-Parse";
+ value = new StringBuffer(line);
+
+ } else {
+ name = line.substring(0, colon).trim();
+ value = new
StringBuffer(line.substring(colon + 1).trim());
}
- name = line.substring(0, colon).trim();
- value = new
StringBuffer(line.substring(colon + 1).trim());
+ // END HERITRIX change.
}

}



Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2004-09-29 21:49 stack-sf
close_date - 2004-09-29 21:49 stack-sf