Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Lost crawl job after terminate running job with jobs pending - ID: 1024120
Last Update: Comment added ( karl-ia )

Queue up three pending jobs. Set the crawler to start
crawling. Crawler goes to tackle first job. Let it
run for a while. Terminate the job. Any subsequent
refresh says the crawler is 'running' but no status bar
and no apparent way of getting at the currently runnign
job.

The above happened on heritrix_1.0 branch. Check bug
is in HEAD too.


Michael Stack ( stack-sf ) - 2004-09-08 03:04

7

Closed

Fixed

Michael Stack

Usability/UI

None

Public


Comments ( 4 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-243 -- please add further
comments at that location.


Date: 2005-03-23 00:01
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed a fix though I ain't exactly sure why the commit
fixes observed issue. Commit message below. Will mark it
fixed and open a new issue for any new manifestations of the
behavior seen in this issue -- hopefully the new forms will
give better clue as to what the timing/jsp issue is.

Fix for '[ 1024120 ] Lost crawl job after terminate running
job with jobs
pending'.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Formatting. Changed notify to notifyAll so UI gets its
wakeup sooner --
rather than having to wait on timeout. Moved start of
any new pending
jobs to crawlEnded from crawlEnding. This fixes issue
and it looks like it
also fixes problem where when 3 jobs are queue, the last
gets stuck at
startup -- the UI doesn't refresh. I'm not sure exactly
why the move of
start new job to crawlEnded fixes observed issue --
timing, or state in the
jsp -- but I've spent enough time trying to figure it.
We may be bitten by something similar again later.


Date: 2005-01-13 18:54
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is a related comment from Kris:

Hey Michael,

It occured to me that now that you loose all the reports
etc. when a job finishes/is terminated, we might want to
change the UI so that the console doesn't drop a job until
it is actually ended, rather then in the process of ending.
This would also make it less likely that people terminate a
job and then shut down the application before the job has
had a chance to fully terminate and write it's reports. As
it is you have to monitor the logs to know when it is safe
to shut of Heritrix, if you just terminated a crawl.

- Kris


Date: 2004-11-15 05:14
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Seen in HEAD at NLA workshop. 'Second' crawl was active, but
invisible to UI as it proceeded. Eventually, it finished (on
its own), and the third job began, appearing in the UI.
There was no evidence of the 2nd job in the UI after its
completion.

Suspect thread safety/timing issue.


Attached File

No Files Currently Attached

Changes ( 10 )

Field Old Value Date By
close_date - 2005-03-23 00:01 stack-sf
resolution_id None 2005-03-23 00:01 stack-sf
status_id Open 2005-03-23 00:01 stack-sf
priority 6 2005-02-11 22:43 gojomo
priority 8 2004-12-03 22:48 gojomo
priority 7 2004-11-15 05:14 gojomo
assigned_to nobody 2004-10-21 19:40 stack-sf
summary Lost crawl job after terminat running job with jobs pending 2004-10-20 23:30 gojomo
priority 5 2004-10-20 23:30 gojomo
summary Lost crawl job 2004-10-20 22:21 gojomo