Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Move BDB to core of Heritrix - ID: 1090663
Last Update: Comment added ( karl-ia )

+ Means crawlcontroller has the environment and
configuring of the env.
+ Means get rid of all Frontiers but BdbFrontier (and
their supporting classes).
+ Means processors, settings, etc., whoever wants to
persist, can assume theres a bdb available to them.

Lets do this for the 1.4 release. Upped the priority.


Michael Stack ( stack-sf ) - 2004-12-24 02:04

7

Closed

None

Michael Stack

refactoring

None

Public


Comments ( 5 )

Date: 2007-03-14 01:37
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-872 -- please add further
comments at that location.


Date: 2005-04-05 10:07
Sender: ck-heritrix

Logged In: YES
user_id=1220421

Sorry, mentioned patch is attached to RFE #1176934.

Cheers,
Christian



Date: 2005-04-05 10:03
Sender: ck-heritrix

Logged In: YES
user_id=1220421

Hi,

I think the following fits best to this issue:

I have just refactored the BdbFrontier class (and its
companions BdbWorkQueue) to a more general, abstract
"WorkQueueFrontier" (and "WorkQueue" respectively) (the
BdbFrontier now is a subclass of WorkQueueFrontier and only
contains Bdb-specific code finally, whereas all management
code resides in the abstract classes now). The separation
probably helps in creating new queues (besides Sleepycat
BDB) and in integrating other frontier concepts like the
AdaptiveRevisitFrontier into a common frontier base.

Please find the patch file attached, feel free to
use/integrate it.

Best regards,
Christian



Date: 2005-01-04 02:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I marked old Frontiers deprecated rather than remove them.

Closing (Moving BdbServerCache to use the new shared
BdbEnvironemnt I'll do as part of the 'never OOM' work).


Date: 2005-01-04 02:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Made a start on this work moving the BdbEnvironment setup to
CrawlController from BdbFrontier.

Start of '[ 1090663 ] Move BDB to core of Heritrix' work
(This also moves
us along on the '[ 1020779 ] Never OOM' issue making it
easier sharing a
single BdbEnvironment).
CrawlController always sets up a bdb environment from here
on out. Go to
CrawlController to get the shared bdb environment and
classcatalog database.
Removed the 'experimental' from BdBFrontier and moved it to
be the default
Frontier. HostQueuesFrontier has been marked as deprecated.
* src/conf/modules/urifrontiers.options
Moved the order around (Probably doesn't do anything but
intent is to
make the BdbFrontier seem like the default).
* src/conf/profiles/default/order.xml
Have BdbFrontier as default frontier.
* src/conf/selftest/order.xml
Test with BdbFrontier (It actually uncovered a
serialization problem
with CrawlURIs fixed as part of this commit).
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Added Bdb cache percentage options here from BdbFrontier.
(bdb-cache-percent): Added.
* src/java/org/archive/crawler/datamodel/CrawlURI.java
Formatting.
*
src/java/org/archive/crawler/datamodel/credential/Credential.java
An Object payload was causing problems serializing a
CrawlURI that had
attached credentials. The attached Credential was an
instance of
the httpclient BasicAuth class. It does not implement
Serializable.
Made payload dumber. Made it into a String. Currently
the only thing
that uses payload is BasicAuth. The form login auth
doesn't use it.
When BasicAuth uses it, the only thing we want from the
stored object is
the realm this auth is for -- so store that only (Doing
this, CrawlURIs
are serializable by Bdb).
*
src/java/org/archive/crawler/datamodel/credential/CredentialAvatar.java
Made this class serializable. Its stuffed into
CrawlURIs so needs to be
serializable if its to be be persisted to a Bdb db.
Also changed the type of payload from Object to String.
*
src/java/org/archive/crawler/datamodel/credential/HtmlFormCredential.java
*
src/java/org/archive/crawler/datamodel/credential/Rfc2617Credential.java
Change in API; popuplate now takes a String, not an
Object for payload.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Change in API; popuplate now takes a String, not an
Object for payload.
For RFC2617, save the Realm.
* src/java/org/archive/crawler/filter/OrFilter.java
Formatting.
* src/java/org/archive/crawler/framework/CrawlController.java
Setup and hold references to Bbd environment (Heritrix
always runs
with a bdb environment now).
(setupBdb, getBdbEnvironment, getClassCatalog): Added.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Removed all setup of bdb environment. Its been moved to
CrawlController.
Use the CrawlController bdb environment instead.
*
src/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java
Use the passed in class catalog rather than create one
of my own.



Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
assigned_to nobody 2005-01-04 02:40 stack-sf
close_date - 2005-01-04 02:40 stack-sf
status_id Open 2005-01-04 02:40 stack-sf