Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 Fetching simple URLs fails with S_CONNECT_FAILED (-2) error - ID: 908719
Last Update: Comment added ( karl-ia )

After only 20-30 second of running a crawl, URLs with
-2 error code and 10 tries (10t) show up in crawl.log.

Some crawl configuration used:
max-tries = 10;
retry-delay-seconds = 900
timeout-seconds = 1200
sotimeout-ms = 20000

First line of the crawl.log:

20040303013639842 1 55 #1 dns:www.cecc.gov
2132 text/dns
P http://www.cecc.gov/
.
.
.

About 20s later:

20040303013658845 -2 . #4
http://www.csce.gov/images/map-lft.gif . . 10t
E http://www.csce.gov/helsinki.cfm
20040303013658849 -2 . #3
http://www.csce.gov/images/text-search.gif . . 10t
E http://www.csce.gov/helsinki.cfm
20040303013658853 -2 . #2
http://www.csce.gov/images/menu-privacy.gif . . 10t
E http://www.csce.gov/helsinki.cfm
.
.
.
.

From local-errors.log (this images is retried 10 time
in just 3 seconds) :

First try:
20040303013649559 -2 . #4
http://www.csce.gov/images/map-lft.gif . .
E http://www.csce.gov/helsinki.cfm
java.net.SocketException: Socket is closed
at java.net.Socket.setSoTimeout(Socket.java:918)
at
org.apache.commons.httpclient.HttpConnection.setSoTimeout(HttpConnection.ja
va:623)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnec
tionAdapter.setSoTimeout(MultiThreadedHttpConnectionManager.java:1174)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:658)

at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)

at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:178)
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processingLoop(ToeThread.java(Compi
led
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:100)

Last try (same error) at timestamp:
20040303013653835


Nobody/Anonymous ( nobody ) - 2004-03-03 02:00

9

Closed

Fixed

Gordon Mohr

General

None

Public


Comments ( 3 )

Date: 2007-03-14 00:08
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-86 -- please add further
comments at that location.


Date: 2004-03-29 23:20
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Believed fixed. Patched HTTPClient to avoid calling
setSoTimeout() on possibly-closed socket. Remedied mistaken
interpretation of operator-specified retry value as ms
instead of seconds. Altered Frontier retry mechanism so that
it is always queues, rather than individual items, that are
snoozed for a future retry.




Date: 2004-03-04 00:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Unsure completely what's going on, but while trying
different variants of the listed seeds, have also triggered
variants of bug 900826: assertion errors caused by the
forceFetch(). (Earlier, such permutations of the triggering
seeds caused a different problem -- an infinite loop
attempting to remove() the top snoozed item -- on crawl12.
It's possible that in a VM without assertions enabled, that
would be a symptom of the 900826 problem.)

In any case, I think there may be several bugs interrelated
here -- starting with the string of initial connect-failures
(which only happen when there are other seeds mixed in),
continuing through the too-rapid-retries, possibly related
to the unsnooze infinite-loop and the assertion errors.

It may make sense to accelerate the refactoring of Frontier
implied by 896766 (hold all of class when one URI snoozed)
and 896772 (site-first prioritization) to correct -- or at
least simplify -- the causes of these problem retries


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-03-29 23:20 gojomo
resolution_id None 2004-03-29 23:20 gojomo
close_date - 2004-03-29 23:20 gojomo
priority 5 2004-03-03 02:05 ia_igor
assigned_to nobody 2004-03-03 02:05 ia_igor