Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 jmx api additions - ID: 1173597
Last Update: Comment added ( karl-ia )

Collect in this RFE what needs to be made available to
external processes via JMX.

Here's a start:

+ Adding an alert (External process judges disk full or
hosed, sends a pause with an alert describing problem).
+ List of disks Heritrix is currently writing to.
+ Listing of outstanding alerts.
+ Signal arc closed.
+ Signal job finished.


Michael Stack ( stack-sf ) - 2005-03-30 20:43

7

Closed

None

Michael Stack

None

1.6.0

Public


Comments ( 11 )

Date: 2007-03-14 01:40
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-912 -- please add further
comments at that location.


Date: 2005-11-06 05:26
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing this issue. To be done is notification of
opening/closing of ARCs but will wait till someone screams
for it. At the moment its a little awkward to do since we
don't as yet have a general event bus that any-old processor
can attach to and publish events of their own fabrication.

Here's commit:
Last of the '[ 1173597 ] jmx api' work.
* src/java/org/archive/crawler/admin/CrawlJob.java
Emit jmx notifications of CrawlStatusListener subset:
crawljob start, stop,
pause and resume.



Date: 2005-11-05 02:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed below. Notifications are all thats outstanding.

More on "[ 1173597 ] jmx api"
* src/java/org/archive/crawler/Heritrix.java
Added registration of shutdown hook that will ensure all
running instances
get a clean up call (Can at least deregister JMX and
JNDI registrations).
Added registration of the Heritrix cmdline "container"
in JNDI.
(registerJndi, deregisterJndi, getJndiContext,
getJndiContainerName): Added.
(getShutdownThread): Added.



Date: 2005-11-04 22:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed below:

More on '[ 1173597 ] jmx api'.
* src/java/org/archive/crawler/Heritrix.java
Add log jmx method. Allows adding log messages from
outside of Heritrix
(E.g. script can add explanatory SEVERE message just
before pausing crawl because disk is bad, etc.).
* src/java/org/archive/crawler/admin/CrawlJob.java
Change name of current crawl job in jmx from UID to
Name+UID (Monitoring wants
to key off job name).



Date: 2005-11-02 18:23
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

TODO for 1.6.0 (Bulk of below come out of meeting held
November 2nd with Danny, Brad, Karl, and Stack in attendence)

+ Notifications for each of the CrawlStatusListener events
and arc open/close).
+ The JMX crawl job name should have the crawl job name as
prefix, not just UID.
+ Add to JNDI registration/deregistration of the 'container'
-- the host in which heritrix instances can be started and
stopped.
+ Add an alert externally.

Then close this issue as done.


Date: 2005-07-20 22:23
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added listing of completed and pending crawl jobs. Added
being able to get crawlend reports. Added being able to
pass a jar of seeds and order and settings either with
local file reference or via a URL.


Date: 2005-07-19 02:35
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

On incidentals mentioned earlier:
+ No, there is not missing flag on importUris so you can
pass a file of seeds (You'll need to do them one at a time
using the string importUri method).
+ Fixed checkpoint message.
+ On bdb fix, yes, now their bean serializes but because
they don't use openmbean, client needs bdb jar on its side
and needs to have custom code undoing the bdb beans (I left
the string versions of stats methods in place).
+ On our being able to register the order file as a jmx
bean, looks like things are better; now we have a class
missing on client side rather than serialization issues.
So, in spare time, need to look at converting settings to
use openmbeans rather than use custom types (Looks like
openmbeans has all types needed. Need to study).


Date: 2005-07-11 23:05
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Talking with Danny and Dan just now, they want soon:

+ Currently you can add a job and specify an order file by
path or url. The url only works if its protocol 'file' at
the moment (The settings system expects 'files'). There is
no means of passing in seeds at job start. The boys want to
be able to add a job and pass the order and seeds -- across
the net -- that this newly added job will use. We talking
an url that pointed at a zip/jar that jmx locally undid. In
the zip/jar would be order and seeds. Optionally look at
addJob taking two strings: One order, other seeds.
+ They need to be able to get at the summary report after
the job is finished so they can report on finished job
(Optionally, allow fetching of order and seeds via jmx. Can
we make this a simple http GET instead)?


Incidentals: Check if seeds flag is missing from the file of
urls import. Fix the checkpoint. Says unimplemented.
Also, new bdbje has fix for their serialization problem.
And last week did work to make settings objects
serializable. Maybe not its possible to publish order in jmx.


Date: 2005-05-20 15:10
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Feedback from Kaisa:

"The idea with filters seems to be that most of them are set
before
running a job. But I often add filters during crawls when
using a set
of seeds for the first time. So an easy way to add filters
online to
any remote controlled job is important! (think)"


Date: 2005-05-19 15:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

See
http://crawler.archive.org/cgi-bin/wiki.pl?JmxControlStatus
for discussion and solicitation of what needs to make it
into the jmx api. The cited doc. adds to api listed so far
here.

I upped the priority since this functionality is needed
imminently.


Date: 2005-05-17 17:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Will use jmx notifications for doing signals.

Would like to use openmbeans so client doesn't need to have
heritrix-particular jars locally to run.

Other things for API:

+ 'Publish' the current job's order file (A crawl job
'order' is itself and is comprised of derivatives of dyanmic
MBeans; unfortunately, the derivations are custom so
'publishing' an order will require adaptation -- unless we
redo settings MBeans).
+ queuing/dequeuing jobs.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-11-06 05:26 stack-sf
summary jmx api 2005-11-06 05:26 stack-sf
close_date - 2005-11-06 05:26 stack-sf
assigned_to nobody 2005-09-29 18:27 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 5 2005-05-19 15:45 stack-sf