Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

6 Stopping, pausing, checkpointing from command line/scripts - ID: 894467
Last Update: Comment added ( karl-ia )

Currently, only way to stop the crawler is to find the
running process in the process list and kill -9 it.
Add a means to gracefully stop the crawler.


Michael Stack ( stack-sf ) - 2004-02-10 16:51

6

Closed

None

Michael Stack

multimachine

None

Public


Comments ( 12 )

Date: 2007-03-14 01:23
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-689 -- please add further
comments at that location.


Date: 2005-01-13 02:32
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. Igor has at least tried it. Just added autostart
of the JMX Agent by Heritrix if on SUN JDK 1.5.0.


Date: 2004-12-13 17:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Needs testing by others than I. Then I'll close this.


Date: 2004-11-17 02:53
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Made a HeritrixMBean and added cmdline option to start MBean
Server. Added a JMXCLient to run Heritrix advertised
functions. Can 'start' and 'stop'. Need to add more.
Changed category.



Date: 2004-11-15 16:09
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Another issue was opened and closed as a duplicate of this
one. Here is text from it:

Feature Requests item #1066441, was opened at 2004-11-14 19:49
Message generated for change (Tracker Item Submitted) made
by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=539102&aid=1066441&group_id=73833

Category: Configuration
Group: None
Status: Open
Priority: 5
[ 1066441 ] command-line options to trigger recrawls (eg
from scripts)

Submitted By: Gordon Mohr (gojomo)
Assigned to: Nobody/Anonymous (nobody)
Summary: command-line options to trigger recrawls (eg from
scripts)

Initial Comment:
From NLA workshop/Matt: Ability to pipe commands
(starting/configuring previous crawl profiles) from
scripts, at specific times.



Date: 2004-07-15 22:30
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I've started a wiki page for diiscussing JMX soln. to thie
prob:
http://crawler.archive.org/cgi-bin/wiki.pl?JmxControlStatus


Date: 2004-06-22 10:59
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

A call for cmd-line shutdown has been made on the list:
http://groups.yahoo.com/group/archive-crawler/message/547.

If we put up a JMX server on crawler start, it should be easy
enough sending shutdown with a JMX client.


Date: 2004-06-03 16:35
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Lowering priority because its been open a while without
getting attention.

Aside from the symmetrical start/stop from the command-line
so our server works like any other server in this regard,
examples of why we'd want to control the crawler from the
command-line might include a cron job to start and stop the
crawler to snapshot a site on a period.

As to the server being unreachable, at least on the
command-line we have the kill -9 fallback. We should do as
the other servers do packaging behind a script the finding
of the pid and the sending of the kill.


Date: 2004-06-03 10:47
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

As I have no intention of getting into the whole
"RMI/SOAP/Remote JMX/JMS/HTTP POST/etc" I'm reassigning this
to Michael.

Personally I'm content with only providing graceful shutdown
via WUI. Odds are that if that isn't responding then RMI and
similar wouldn't be either. And no one has given me a really
good example of why a script would want to do this.


Date: 2004-02-18 19:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Signal handling is JVM particular.

From the cmdline, we can't do any more than a "kill -9 `echo
heritirx.pid`". For a graceful shutdown we need to write a
java RMI/SOAP/Remote JMX/JMS/HTTP POST/etc. client to send
the message w/ appropriate authorizations.

Here's a list of cmd-line tasks such a client might do:

+ Shutdown
+ Graceful shutdown (Checkpoint?, shutdown of webserver,
stop of all threads and sys.exit()).
+ (Emergency) checkpoint

What else would we ever want the client to do?


Date: 2004-02-17 22:48
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Investigate: is there an alternate signal that can be caught
in Java and trigger a graceful shutdown (or, when available,
a checkpoint & shutdown)


Date: 2004-02-10 22:46
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

I've implemented this for the UI. Option availible under the
Console.

Some steps are taken to terminate ongoing crawls in an
orderly manner.


Attached File

No Files Currently Attached

Changes ( 10 )

Field Old Value Date By
status_id Open 2005-01-13 02:32 stack-sf
close_date - 2005-01-13 02:32 stack-sf
category_id Usability/UI 2004-11-17 02:53 stack-sf
summary Stopping, pausing, checkpointing from cmdline 2004-11-15 21:33 gojomo
priority 3 2004-09-01 22:04 stack-sf
summary Means of stopping the crawler (cmdline & UI) 2004-06-22 10:59 stack-sf
priority 5 2004-06-03 16:35 stack-sf
assigned_to kristinn_sig 2004-06-03 10:47 kristinn_sig
assigned_to nobody 2004-02-17 22:45 gojomo
summary Means of stopping the crawler (cmdline & UI) 2004-02-10 22:46 kristinn_sig