Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

6 Untangle heritrix from jetty - ID: 1045817
Last Update: Comment added ( karl-ia )

Notion is that we'd keep the current configuration that
has an embedded Jetty but that we'd also have a
heritrix config. that allows us to deploy heritrix UI
under other containers such as tomcat, jboss, etc.


Michael Stack ( stack-sf ) - 2004-10-13 00:27

6

Closed

None

Michael Stack

multimachine

None

Public


Comments ( 8 )

Date: 2007-03-14 01:34
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-837 -- please add further
comments at that location.


Date: 2004-11-17 18:00
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added to FAQ and added auto-building of WAR.
Closing.


Date: 2004-11-17 02:55
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Mostly done as part of the below commit. Need to make the
build automatically build war file so its easy for people to
play with it and add to FAQ.


Start of '[ 894467 ] Stopping, pausing, checkpointing from
command line/scripts'and most of '[ 1045817 ] Untangle
heritrix from jetty' (Its possible to run
Heritrix in Tomcat now).
* .classpath
Added in the jmx client (remote) libs. Put the commons
libs all together.
* maven.xml
Bundle modules, profiles, and heritrix.* into the
heritrix jar file.
* project.properties
Added jmx client (remote) libs.
* project.xml
Added jmx client (remote) libs and bundle modules,
profiles, and
heritrix.* into the heritrix jar file.
* src/conf/heritrix.properties
Jmx server defaults.
* src/java/org/archive/crawler/CommandLineParser.java
Make mention of new jmx server option.
* src/java/org/archive/crawler/Heritrix.java
Lots of refactoring. Added new Heritrix constructor
used in webapp
context later will be used when we float heritrix in
jboss to start
up heritrix MBean. Later make it so that we use it from
command line
too (Move the initialize method into the constructor)
Most of
refactoring is so we don't assume cmdline or
that we're running inside in jetty. We provide defaults
if none can
be found and we look in CLASSPATH for configuration
files such as
modules and profiles if they can't be found on disk.
Should be easier to
embed Heritrix now, just a matter of (new
Heritrix()).launch() should
be enough to get you going).
(doStart): removed. renamed ...
(doCmdLineArgs): Was doStart. Added in handling of the
new 'j' option.
(selftest): Made it non-static. Same for launch.
(startJmxServer, isStarted, start, stop, getStatus): Added.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Changes so we look for profile in CLASSPATH if not found
on filesystem.
Changed name of default profile from 'Simple' to 'default'.
Refactoring moving reuseable code into own method,
loadProfile.
* src/java/org/archive/crawler/admin/ui/JobConfigureUtils.java
Added new printOutSeeds method. Removed duplicated code
from jsp.
Made the jsp use this function instead.
* src/java/org/archive/crawler/admin/ui/WebappLifecycle.java
Use new start and stop functions from Heritrix.
* src/java/org/archive/crawler/datamodel/SeedList.java
Removed CrawlController references. Not needed and
could cause memory
retention issues. Added function that allows me to get
a seed list as
a stream so can get seeds from CLASSPATH if not found on
file system.
* src/java/org/archive/crawler/datamodel/SeedListTest.java
Adjust test so that it suits changed API.
* src/java/org/archive/crawler/framework/CrawlScope.java
Adjust to suite new SeedList constructor API.
* src/java/org/archive/crawler/settings/XMLSettingsHandler.java
Allow that settings object may be an inputstream so we
are able to read
order.xml from CLASSPATH.
*
src/java/org/archive/httpclient/ConfigurableX509TrustManager.java
Formatting.
* src/java/org/archive/util/OneLineSimpleLogger.java
Added setConsoleHandler static method.
* src/webapps/admin/index.jsp
Move the check for logged-in-ness into here. Tomcat
doesn't like
hard references to the login page. Also, if we were not
launched from
the command line, remove the shutdown button (Doesn't
make sense when
launched by the container).
* src/webapps/admin/login.jsp
Moved check for logged-in-ness to index.jsp.
....


Date: 2004-11-11 00:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority.

Attached patch that finishes of the work that makes us work
as standalone webapp. I can close this issue once I get a
chance to commit the patch.

Need to add the building of war file to crawltools config.


Date: 2004-11-10 22:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority.

Attached patch that finishes of the work that makes us work
as standalone webapp. I can close this issue once I get a
chance to commit the patch.

Need to add the building of war file to crawltools config.


Date: 2004-11-10 22:51
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority.

Attached patch that finishes of the work that makes us work
as standalone webapp. I can close this issue once I get a
chance to commit the patch.

Need to add the building of war file to crawltools config.


Date: 2004-11-10 22:50
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority.

Attached patch that finishes of the work that makes us work
as standalone webapp. I can close this issue once I get a
chance to commit the patch.

Need to add the building of war file to crawltools config.


Date: 2004-10-21 01:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Have undone all hardcodings of an 'admin' context. Also
removed our custom login and put in place instead a
container managed FORM.

Next is catching webapp initialization and shutdown so that
when the webapp comes up, heritrix is ready to crawl.


Attached Files ( 4 )

Filename Description Download
webapp.patch Patch to make heritrix work as webapp. Download
webapp.patch Patch to make heritrix work as webapp. Download
webapp.patch Patch to make heritrix work as webapp. Download
webapp.patch Patch to make heritrix work as webapp. Download

Changes ( 7 )

Field Old Value Date By
status_id Open 2004-11-17 18:00 stack-sf
close_date - 2004-11-17 18:00 stack-sf
File Added 108369: webapp.patch 2004-11-11 00:40 stack-sf
File Added 108366: webapp.patch 2004-11-10 22:58 stack-sf
File Added 108364: webapp.patch 2004-11-10 22:51 stack-sf
File Added 108363: webapp.patch 2004-11-10 22:50 stack-sf
priority 5 2004-11-10 22:50 stack-sf