Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 Hang in http fetcher when mid-fetch aborts - ID: 1067095
Last Update: Comment added ( karl-ia )

We hang in http fetcher if using midFetchAbort filters.

Reported by Tom Emerson. See below.

Tom gave me an order file and seeds to reproduce the
problem with. Happens fairly soon after startup using
his order and seeds.

When hung, thread dump showed we were stuck trying to
get a connection from host connection pool. See below.

Turns out we were not returnign aborted connection to
the connection pool (Didn't think it was necessary
since doing the release after the abort threw ugly
Connection is not Open exceptions).

Added to this issue is a patch that first releases a
connection before calling abort. Does it for the
midfetch, and for timer and length aborts.

Giving to Tom to test.


I'm running a build synched from CVS head this
afternoon. All 50
threads are stuck: here's a subset of the toe threads
report:

Toe threads report - 200411130004
Job being crawled: Vietnamese1
Number of toe threads in pool: 50 (50 active)
ToeThread #1
#1
http://www.saigonnet.vn/homepage-data/tb/2004/tb-taikhoan.htm
(0 attempts)
X http://www.saigonnet.vn/
Current processor: HTTP
ACTIVE for 21m33s576ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 1293575ms
ToeThread #2
#2 http://www.mot.gov.vn/en/index.asp (0 attempts)
L http://www.mot.gov.vn/
Current processor: HTTP
ACTIVE for 21m23s194ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 1283193ms
ToeThread #3
#3
http://203.162.1.217/ASX_01042004/041111canhac_motthoangtaynguyen.wmv
(0 attempts)
ELL
http://vnntelevision.net/VOD/index.asp?offset=10
Current processor: HTTP
ACTIVE for 21m41s48ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 1301047ms

and so on and so on for all 50.

This has happened in two crawls with two different seed
lists.

Is this a problem on my side (we had some network
issues earlier which
were fixed) or is this indicatgive of something else.

-tree


Michael Stack ( stack-sf ) - 2004-11-16 02:07

9

Closed

None

Michael Stack

i/o

None

Public


Comments ( 4 )

Date: 2007-03-14 01:36
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-855 -- please add further
comments at that location.


Date: 2004-11-16 18:56
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tested by Tom. Closing.




Date: 2004-11-16 18:47
Sender: tree

Logged In: YES
user_id=37068

This fix appears to have unstuck me. Thanks!



Date: 2004-11-16 02:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here's what the stack dump shows... 50 threads all stuck here:

Thread 2451: (state = BLOCKED)
- java.lang.Object.wait(long) (Interpreted frame)
-
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(org.apache.commons.httpclient.HostConfiguration,
long) @bci=326, line=497 (Compiled frame)
-
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(org.apache.commons.httpclient.HostConfiguration,
long) @bci=74, line=382 (Interpreted frame)
-
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(org.apache.commons.httpclient.HttpMethod)
@bci=177, line=161 (Interpreted frame)
-
org.apache.commons.httpclient.HttpClient.executeMethod(org.apache.commons.httpclient.HostConfiguration,
org.apache.commons.httpclient.HttpMethod,
org.apache.commons.httpclient.HttpState) @bci=247, line=437
(Interpreted frame)
-
org.apache.commons.httpclient.HttpClient.executeMethod(org.apache.commons.httpclient.HttpMethod)
@bci=35, line=324 (Interpreted frame)
-
org.archive.crawler.fetcher.FetchHTTP.innerProcess(org.archive.crawler.datamodel.CrawlURI)
@bci=154, line=299 (Interpreted frame)
-
org.archive.crawler.framework.Processor.process(org.archive.crawler.datamodel.CrawlURI)
@bci=50, line=102 (Compiled frame)
- org.archive.crawler.framework.ToeThread.processCrawlUri()
@bci=127, line=255 (Compiled frame)
- org.archive.crawler.framework.ToeThread.run() @bci=97,
line=131 (Interpreted frame)



Attached File ( 1 )

Filename Description Download
diff.txt Patch that releasses aborted connectison. Download

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-11-16 18:56 stack-sf
close_date - 2004-11-16 18:56 stack-sf
File Added 108923: diff.txt 2004-11-16 02:07 stack-sf