Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 More than one Heritrix instance in a JVM instance - ID: 1260360
Last Update: Comment added ( karl-ia )

Make it so its possible to run more than a single
instance of Heritrix in a JVM. Make it so its possible
to start up multiple instances each running its own job
without interference across the running instance.


Michael Stack ( stack-sf ) - 2005-08-15 22:24

7

Closed

None

Karl Thiessen

multimachine

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 01:43
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-958 -- please add further
comments at that location.


Date: 2005-08-15 23:39
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented. Passing to Karl to check it out.

TO TEST NEW FEATURE:

To test, add new heritrix instances via jmx -- the jmx
cmdline-client has been enhanced to take name of a
non-existent bean with a 'create=CLASS' operation (see its
help output for more detail) -- or type in
/local-instance.jsp in the UI and use the primitive
functionality available here to add new instances and switch
the UI to home on different Heritrix instances.

Commit message:

Implement '[ 1260360 ] More than one Heritrix in a JVM instance'
Also added registration of heritrix instance with
(configurable) jndi provider
(Added a basic in-memory jndi service provider, MirrorJNDI).
Redid Alert system to use new SinkHandler. Done partly because
Alert was close to what java.util.logging provides but
mainly because
as written, alert requires access to current Heritrix
instance from anywhere:
Would mean passing Heritrix instance to all classes that
would want to alert
now we've de-'staticized' Heritrix to allow more than one
instance of Heritrix
per JVM. Cleaned up property handling. All property
accesses that used
go via Heritrix.getProperty* not changed to go via
System.getProperty* (On
instantiation, all keys with a org.archive.crawler.* or
heritrix.* prefix
are added to system properties).
* .classpath
* project.properties
* project.xml
Added very basic jndi service provider.
* maven.xml
Copy over jndi.properties file.
* src/conf/heritrix.properties
Added in SinkHandler as a handler. Its configured to act on
all WARNING/SEVERE level logs.
Upped the log level for CrawlJob so its jmx registration
of new jobs
shows in heritrix_out.
* src/java/org/archive/crawler/Heritrix.java
Refactoring. If we were launched from the cmdline, we'll
register our
Heritrix instance. Otherwise, if launched in another
context -- by a j2ee
container -- let it manage the registration. Added
listening for mbean
registration to can catch name registered as. Also on
registration/
deregistration add/remove ourselves from configured jndi.
Cleaned up property handling. Add all
org.archive.crawler* and
heritrix.* properties to system properties.
Made many static data members instance data members
instead. Also removed
some data members making them local to them method
instead (Some
such as 'noWui' or 'selftesturl' didn't need to be class
variables). Added
instance of new AlertManager interface in place of old
alert handling
methods (New manager delegates to SinkHandler). Added
jndiContext. Added
vector of all local Heritrix instances. Added methods to
get at instances.
Moved deleteJob jmx method from CrawlJob to here. Jmx
name for Heritrix
bean now has 'host' part. Also, jmx 'name' part is
no longer 'Heritrix' always to allow for multiple
Heritrix MBeans/Instances
in the one JVM. Moved property handling that used happen
in here out to a
new PropertyUtils class (getPropertyOrNull,
getBooleanProperty,
getIntProperty). Redid registration of MBeans support.
Added to JMX status name of current crawl job if one.
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
* src/java/org/archive/crawler/WebappLifecycle.java
Formatting.
* src/java/org/archive/crawler/admin/CrawlJob.java
Refactoring. Moved job crawlcontroller in here from
CrawlJobHandler.
Also moved bunch of methods that seemed to fit CJ better
than CJH (pause,
resume, checkpoint, importUris, etc.).
Added listerner for MBean registration.
Merged the bdbje MBean into the CrawlJob MBean (The
bdbje attributes
and operations are now available out of this MBean. Also
fixed the
database stats so they work now so can get state of any
bdbje db at
crawl time). Moved the terminate of job from here to the
Heritrix MBean.
(It matches better the addJob that is already in
Heritrix and it means
that CJ doesn't have to have CJH knowledge). Made all
constructors share
common code. Changed startCrawling. Have it register
this MBean.
Moved crawljobs reports from CJH into here
(getFrontierOneLine,
getFrontierReport, etc.).. Redid the way I do the MBean
description to
use the bdbje model; rather than keep an array of
attributes and operations, instead do lists and at end
covert to arrays. Saves on having to
renumber array memmbers as we add/remove elements.
Made it implement CrawlStatusListener.
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
* src/java/org/archive/crawler/admin/CrawlJobErrorHandler.java
Formatting.
* src/java/org/archive/crawler/admin/CrawlJobErrorHandler.java
Formatting.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Moved much of this class to CJ where it seems to sit
better (reports,
crawl job operatiosn such as pause/resume, etc.).
Removed management of registration of jobs with jmx.
Let the MBeans
add and remove themselves. Redid the
CrawlStatusListener. Now it just
does upkeep of its list of jobs (Other functionality has
been moved to CJ).
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
* src/java/org/archive/crawler/admin/StatisticsTracker.java
Allow that controler may be null (Happens when we're
starting up sometimes). Formatting.
* src/java/org/archive/crawler/datamodel/BigMapFactory.java
Go via system properties rather than via heritrix to get
config.
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Removed contraint on user_agent and from. Was throwing
new alert every
time page accessed. Now we just get one alert on end
after submission if
problem.

2,1 Top

Make severe message carry its exception rather than just
exception message.
*
src/java/org/archive/crawler/postprocessor/LowDiskPauseProcessor.java
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
*
src/java/org/archive/crawler/selftest/SelfTestCrawlJobHandler.java
Formatting. Pass in selftesturl rather than go to get a
static from
Heritrix.
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXHandler.java
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
* src/java/org/archive/crawler/settings/ValueErrorHandler.java
Formatting.
* src/java/org/archive/crawler/util/BdbUriUniqFilter.java
Add to exception message state of recycle.
* src/java/org/archive/crawler/util/LogUtils.java
Go via PropertyUtils and System rather than Heritrix to
get config.
properties.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
Remove alerts. Instead redo them as straight severe
logs, the new fashion.
* src/java/org/archive/util/JmxUtils.java
Additions. Log message on registration/deregistration.
Details on
MBeanServer. Converters for non-openmbean attriubtes and
operations (so
the bdbje can sit in our openbean CJ)..
* src/webapps/admin/index.jsp
* src/webapps/admin/reports.jsp
* src/webapps/admin/include/head.jsp
Reference current heritrix instance rather a no longer
existing static
Heritrix.
* src/webapps/admin/console/alerts.jsp
* src/webapps/admin/console/readalert.jsp
Reference current heritrix instance rather a no longer
existing static
Heritrix. Also, redone so uses LogRecord equivalents for
former Alert
functionality (id, time, level).
* src/webapps/admin/include/handler.jsp
Now add current heritrix instance to the application
context.
* src/webapps/admin/reports/frontier.jsp
* src/webapps/admin/reports/processors.jsp
* src/webapps/admin/reports/threads.jsp
API changed for accessing reports (Must go to CJ now).
* lib/MirrorJNDI-1.0.jar
Added.
Very basic in-memory jndi service provider.
* src/java/org/archive/crawler/framework/AlertManager.java
Added.
Interface.
* src/java/org/archive/io/SinkHandler.java
Added.
Log handler that keeps around reference to LogRecords
(till deleted).
* src/java/org/archive/io/SinkHandlerLogRecord.java
Added.
LogRecord with time of occurrance and support for
whether its been read
or not.
* src/java/org/archive/io/SinkHandlerTest.java

* src/java/org/archive/util/JndiUtils.java
Added.
Utilities for jndi.
* src/java/org/archive/util/PropertyUtils.java
Added.
Properties methods that used be heritrix.
* src/webapps/admin/local-instances.jsp
Added.
A rough jsp page that lists all local heritrix instances
and their state.
Allows creation of new instances. Use this page to
switch the UI between
instances.



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
artifact_group_id None 2005-09-23 20:53 gojomo
priority 5 2005-09-23 19:00 gojomo
assigned_to nobody 2005-08-15 23:39 stack-sf