Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Confusion: CrawlController and CrawlJob States - ID: 1104696
Last Update: Comment added ( karl-ia )

From kris:

Hey Michael,

I just found a nasty little bug. It seems that the
CrawlJob states and the CrawlController states are
being mixed up, or rather, the CC states are being sent
to the CrawlJob instances, instead of proper CrawlJob
constants.

The best example, once a crawl is started it's state
should be 'Running' (CrawlJob constant) but is
currently set as 'RUNNING' (CrawlController constant).
As a result the UI will in at least one instance,
incorrectly interpret the state of the crawler since it
compares the reported states against the CrawlJob
constants. This really, really, really needs to be fixed.

My suggestion for a fix; change the

sendCrawlStateChangeEvent(Object newState, String message)

to only accept one parameter (newState), then send a
message based on the new state using CrawlJob
constants. For example:

if (newState.equals(PAUSED)) {
l.crawlPaused(CrawlJob.STATUS_PAUSED);
}

Problems with this:
1. May need to add more CrawlJob constants. At least
'waiting to finish' (may need to make changes to UI to
accomodate these.

2. The finish message will need more detail, (finish,
timelimit, ended by operator etc.)

In any case, this needs to be resolved. Currently,
things are a complete mess.

- Kris

Marking this as a higher than normal bug (I'm guessing
this explains why the cmdline client sometimes get
bogus state from the crawler).


Michael Stack ( stack-sf ) - 2005-01-18 18:43

7

Closed

None

Nobody/Anonymous

None

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-881 -- please add further
comments at that location.


Date: 2005-11-02 20:45
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Situation was improved by previous fixes; other needs should
be raised as new issues if/when need/plan is clearer.


Date: 2005-03-01 22:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Applied suggested patch with minor mod (Changed
'sendCrawlStateChangeEvent(state, state.toString());' to
'sendCrawlStateChangeEvent(state, jobState);').

Below is commit.

Lowered priority.

Moved issue to RFE database. Lets address crawl job and
crawl controller state confusion refactoring as an RFE.

Palliative for '[ 1104696 ] Confusion: CrawlController and
CrawlJob States'
Below is a Gordon patch.
* src/java/org/archive/crawler/framework/CrawlController.java
(sendCrawlStateChangeEvent): On startup, pass in
CrawlJob.STATUS_* rather than CrawlController.* status
as message on startup. Previous we passed in
CrawlController state which confused CrawlJob at start time.



Date: 2005-02-21 11:23
Sender: ck-heritrix

Logged In: YES
user_id=1220421

I have just suffered from this bug. Are there any objections
against merging the suggested patch into CVS HEAD?

Christian



Date: 2005-01-21 09:24
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Long-term, there should be a clear distinction between
CrawlController (API to an abstract 'crawl') and CrawlJob
(UI/workflow object). CrawlController would have no
references to CrawlJob, with CrawlJob just one possible
client of CrawlController's generic capabilities.

Currently, it looks like the intent of the code is for the
'newState' argument of sendCrawlStateChangeEvent() to be
drawn from CrawlController states, and the 'message'
argument drawn from CrawlJob statuses. The code respects
this split, *except* in requestCrawlStart(), where the
CrawlController states are used in both positions.

So a potential fix that delays larger refactoring would be
to harmonize requestCrawlStart()'s use of
sendCrawlStateChangeEvent() with the uses. Here's a patch
that should do the trick:

Index: CrawlController.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/framework/CrawlController.java,v
retrieving revision 1.90
diff -u -r1.90 CrawlController.java
--- CrawlController.java 11 Jan 2005 03:55:51 -0000 1.90
+++ CrawlController.java 21 Jan 2005 09:19:41 -0000
@@ -835,8 +835,16 @@

// Assume Frontier state already loaded.
logger.info("Starting crawl.");
- sendCrawlStateChangeEvent(STARTED, STARTED.toString());
- state = beginPaused ? PAUSED : RUNNING;
+
+ sendCrawlStateChangeEvent(STARTED,
CrawlJob.STATUS_PENDING);
+ String jobState;
+ if(beginPaused) {
+ state = PAUSED;
+ jobState = CrawlJob.STATUS_PAUSED;
+ } else {
+ state = RUNNING;
+ jobState = CrawlJob.STATUS_RUNNING;
+ }
sendCrawlStateChangeEvent(state, state.toString());
// A proper exit will change this value.
this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;




Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
close_date - 2005-11-02 20:45 gojomo
status_id Open 2005-11-02 20:45 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 5 2005-09-23 19:00 gojomo
priority 7 2005-03-01 22:58 stack-sf
data_type 539102 2005-03-01 22:58 stack-sf
priority 5 2005-01-18 18:43 stack-sf