Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 MultiThreadedHttpConnectionManager https already connected - ID: 1059237
Last Update: Comment added ( karl-ia )

20041103031304383 -2 -
https://login.yahoo.com/robots.txt LRXRP
https://login.yahoo.com/config/login?.src=chat&.done=http://chat.yahoo.com/
c/roomlist.html&.intl=us
no-type #041 - - -
java.net.SocketException: already connected
at java.net.Socket.connect(Socket.java:433)
at
com.sun.net.ssl.internal.ssl.SSLSocketImpl.connect(DashoA6275)
at
org.archive.crawler.fetcher.HeritrixSSLProtocolSocketFactory.createSocket(H
eritrixSSLProtocolSocketFactory.java:135)
at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:669)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnec
tionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:369)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:178)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:437)

at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)

at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:299)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:131)
20041103031309994 -2 -
https://sec.yimg.com/robots.txt LREEP
https://sec.yimg.com/i/b5/arrow.gif no-type #025 - - -
java.net.SocketException: already connected
at java.net.Socket.connect(Socket.java:433)
at
com.sun.net.ssl.internal.ssl.SSLSocketImpl.connect(DashoA6275)
at
org.archive.crawler.fetcher.HeritrixSSLProtocolSocketFactory.createSocket(H
eritrixSSLProtocolSocketFactory.java:135)
at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:669)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnec
tionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMetho
dDirector.java:369)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDi
rector.java:178)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:437)

at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)

at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:299)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:131)

Looks like problem in my ssl factory:

public Socket createSocket(String host, int port,
InetAddress localAddress,
int localPort, HttpConnectionParams params)
throws IOException, UnknownHostException {
// Below code is from the
DefaultSSLProtocolSocketFactory#createSocket
// method only it has workarounds to deal with
pre-1.4 JVMs. I've
// cut these out.
if (params == null) {
throw new
IllegalArgumentException("Parameters may not be null");
}
Socket socket = null;
int timeout = params.getConnectionTimeout();
if (timeout == 0) {
socket = createSocket(host, port,
localAddress, localPort);
} else {
socket = this.sslfactory.createSocket();
InetAddress hostAddress = getHostAddress(host);
InetSocketAddress address = (hostAddress !=
null)?
new InetSocketAddress(hostAddress,
port):
new InetSocketAddress(host, port);
socket.connect(address, timeout);
try {
socket.connect(address, timeout);
} catch (SocketTimeoutException e) {
// Add timeout info. to the exception.
throw new
SocketTimeoutException(e.getMessage() +
": timeout set at " +
Integer.toString(timeout) + "ms.");
}
assert socket.isConnected(): "Socket not
connected " + host;
}
return socket;
}


I need to somehow ask httpclient if it already has a
socket, and if so, resuse it rather than make a new
connection.


Michael Stack ( stack-sf ) - 2004-11-03 03:31

9

Closed

Fixed

Michael Stack

3rd-party libs

None

Public


Comments ( 6 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-279 -- please add further
comments at that location.


Date: 2004-11-03 21:28
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed. Closing.

Fix for '[ 1059237 ] MultiThreadedHttpConnectionManager
https already connected'*
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
Removed one of two connects added mistakenly in last edit.



Date: 2004-11-03 20:36
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

+1
One connect looks like plenty to me!


Date: 2004-11-03 20:24
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigning to gordon for review. Assign back when done.


Date: 2004-11-03 20:24
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigning to gordon for review. Assign back when done.


Date: 2004-11-03 18:02
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here's the fix:

Index:
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java,v
retrieving revision 1.1
diff -u -r1.1 HeritrixSSLProtocolSocketFactory.java
---
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
28 Oct 2004 17:59:27 -0000 1.1
+++
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
3 Nov 2004 18:01:44 -0000
@@ -130,7 +130,6 @@
InetSocketAddress address = (hostAddress != null)?
new InetSocketAddress(hostAddress, port):
new InetSocketAddress(host, port);
- socket.connect(address, timeout);
try {
socket.connect(address, timeout);
} catch (SocketTimeoutException e) {

Upped the priority so it gets consideration.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-11-03 21:28 stack-sf
resolution_id None 2004-11-03 21:28 stack-sf
close_date - 2004-11-03 21:28 stack-sf
assigned_to gojomo 2004-11-03 20:36 gojomo
assigned_to stack-sf 2004-11-03 20:24 stack-sf
priority 5 2004-11-03 18:02 stack-sf