Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 completed crawls show as 'aborted by user' - ID: 1055854
Last Update: Comment added ( karl-ia )

Ran two small jobs in heritrix-1.1.0-200410271652. Each
completed by finishing the domain in question; they
show in the completed jobs page as "Aborted by user."

(1) They should indicate normal finish.
(2) Even the "Aborted by user " terminology could be
better; let's do "ended by operator".




Gordon Mohr ( gojomo ) - 2004-10-28 04:10

9

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 6 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-274 -- please add further
comments at that location.


Date: 2004-11-03 22:06
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Change committed. Commit comment:

Fix for [ 1055854 ] completed crawls show as 'aborted by user'
* HostQueuesFrontier.java
For test of isEmpty() near end of next(), don't try to
end crawl, just 'continue' through so next
controller.checkFinish() can end things cleanly.


Date: 2004-11-03 21:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Testing: I see the change take effect because I can see
status for small crawls before patch logged as 'aborted by
user' but post patch, the same crawl has 'finished'. I also
see that if I actually terminate the crawl, I get 'aborted
by user' as I should.

+1 on this patch.

Back to G.


Date: 2004-11-03 21:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Testing: I see the change take effect because I can see
status for small crawls before patch logged as 'aborted by
user' but post patch, the same crawl has 'finished'. I also
see that if I actually terminate the crawl, I get 'aborted
by user' as I should.

+1 on this patch.

Back to G.


Date: 2004-11-03 20:31
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Looking into this, I see that the isEmpty() check and block near
the end of HostQueuesFrontier is causing problems by trying
to end the
crawl itself. In particular, by using
controller.requestCrawlStop(),
a 'finished normally' looks like an 'ended by operator'.

The most simple fix is to let the controller.checkFinish() (at
the top of next()) end things cleanly on the next loop pass.
Thus,
upon detecting an empty frontier, the near-end isEmpty()
check only
needs to make sure there's no extra wiating for state to change,
and it can do this by a 'continue' of the while(true) loop.

(It appears that the only way that the initial checkFinish() can
return false, but the end-of-method isEmpty() return true, is if
a politeness-snoozed queue woke in the call to
wakeReadyQueues(), was empty, and gets discarded. Thus
another potential fix would be to move the wakeReadyQueues()
line
mid-method to before the controller.checkFinish(). Then the
near-end isEmpty() check would be superfluous. However, this
is slightly higher risk, I'll defer such a fix to later.)

Proposed fix, which cleanly 'finishes' my short
HostQueuesFrontiers test crawls that had appeared 'operator
ended' before:

Index: HostQueuesFrontier.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/frontier/HostQueuesFrontier.java,v
retrieving revision 1.24
diff -u -r1.24 HostQueuesFrontier.java
--- HostQueuesFrontier.java 29 Oct 2004 01:40:52 -0000 1.24
+++ HostQueuesFrontier.java 3 Nov 2004 20:11:07 -0000
@@ -626,11 +626,10 @@
enqueueToKeyed(curi);
}
}
-
+
// See if URIs exhausted
if(isEmpty()) {
- this.controller.requestCrawlStop();
- throw new EndedException("exhausted");
+ continue; // next controller.checkFinish()
will end cleanly
}

if(alreadyIncluded.pending() > 0) {



Date: 2004-11-03 19:24
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Changed terminology. 'Finished' crawls showing as 'ended by
operator' only occurs from HostQueuesFrontier... (works
right in BdbFrontier). Investigating.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2004-11-03 22:06 gojomo
close_date - 2004-11-03 22:06 gojomo
resolution_id None 2004-11-03 22:06 gojomo
assigned_to stack-sf 2004-11-03 21:45 stack-sf
assigned_to gojomo 2004-11-03 20:31 gojomo
assigned_to nobody 2004-11-03 19:17 gojomo
priority 5 2004-11-03 19:14 gojomo