Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Evaluate Berkeley DB Frontier - ID: 1002704
Last Update: Comment added ( karl-ia )

This RFE covers looking at making a BDB based-Frontier.

BDB has caching, serialization tricks, a collections
interface, sorting, duplicate checks, its ACID, dbs can
be moved between machines, has recovery tools, and it
has a track record.

If frontier was BDB based, its ACID'ness might make
checkpointing easier to do.

First we'd need to look at evaluating whether BDB could
work with thousands of databases of millions of records
-- how does it perform compared to disk-based queues?

Then we'd do some db design that would figure when to
create a db (per host, a metadb that keeps the state of
all known queues, secondary dbs for alternate keys),
how we'd get our objects into the dbs, size of caches.

Comparing BDB to DiskStack for the has-it-been-seen
might be the first place to look at integrating BDB.


Michael Stack ( stack-sf ) - 2004-08-03 15:55

5

Closed

None

Gordon Mohr

None

None

Public


Comments ( 5 )

Date: 2007-03-14 01:32
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-813 -- please add further
comments at that location.


Date: 2004-12-13 18:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. We're way past the evaluation stage having built
an actual implementation.


Date: 2004-10-08 18:32
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigning to Gordon. He's been working on an implementation
of late.


Date: 2004-09-27 21:18
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below is note from gordon this morning and some comments....






Sounds good. Lets design some tests and divvy up the work.

If all is in one database, I'm guessing we'll have to
maintain multiple concurrent cursors going against the one
continually updated db. Lets see how bdb performs under
these circumstances.

I'm assuming that as soon as we'd see a new host, we'd do an
insert in both the PENDING and QUEUEs db. We might want
to make this double insert transactional. There are other
benefits to transactions such as quicker startup times and
faster checkpointing -- both BDBs and the to-be heritrix's
-- but it also comes with costs. If we want to do this,
lets see how much transactions cost.

Maybe we can use the cursor search mechanism to find the
host to serve next rather than keep queue head keys in a
separate QUEUEs db.

The QUEUEs db content looks to be volatile. A read of the
PENDING db looks like it could restore the state after
crash. It doesn't look to be a structure of large size
either -- an entry per host at most. Maybe this could be an
inmemory structure.

Might simplify things if we have a db per priority.

The alreadyseen tests I've been running have 64million
entries with lookup/insert times of ~6ms
(http://crawler.archive.org/cgi-bin/wiki.pl?AlreadySeen). A
wider key with a value payload -- the alreadyseen lists
don't have values -- would not be as performant; lets see
how much worse it is.

(Here's an old RFE for this issue:
https://sourceforge.net/tracker/index.php?func=detail&aid=1002704&group_id=73833&atid=539102.
I've copied this note into it.)

Good stuff.
St.Ack





Gordon Mohr (Internet Archive) wrote:

> I'd like to work out the design of a Frontier which uses
BerkeleyDB
> Java Edition for all its queue management. (It might also
uses BDB for
> its already-seen structure, but that choice can be
independent.)


>
> The potential benefits would include:
> - no more out-of-memory problems triggered by number of
host queues
> - more simple code, built on the BDB primitives rather
than our
> own mix of disk-based queues
> - much easier Frontier checkpointing
> - potentially, much more efficient use of disk I/O and thread
> synchronizations -- BDB's thread-safety model and automatic
> caching may perform better than all but highly-optimized
> systems we would desing ourselves
>
> I've been working on three things in the existing Frontier --
> reducing lock contention, capping memory usage across an
unbound
> number of distinct queues, and simplifying the internal
queues to
> ease checkpointing -- and it's been very difficult,
especially if
> we try to retain existing features like jumping
prerequisites or
> embeds to the front of their same-host queues, or even keeping
> distinct queues for a site first approach.
>
> I think the BDB approach should make these much easier,
and so I'm
> eager to flesh it out and test it with working code before
investing
> much more in a custom approach.
>
> How it might work:
>
> Though in the future -- doing web-wide crawls or crawls
which consider
> the whole change history of a site -- we may want a
separate BDB database
> per host, I think at least initially we could have all
queues of all
> pending URIs held in a single database. Call this database
PENDING.

> The keys would be synthetic, to impose a useful grouping
and ordering
> on the entries, and the values would be the serialized
CrawlURIs.
>
> A key could be 128 bits:
>
> 60 bits: fingerprint of 'class' (either host, or full
authority)
> (would take 64-bit fingerprint and truncate off
4 bits)
> 4 bits: 'priority', where all items of lower priority
are visited
> before later
> 64 bits: serial number (monotonically increasing global
counter)
>
> This would group URIs from the same 'class'; ensure
higher-priority
> items are always sorted before lower-priority items; and
otherwise
> leave them in order-of-discovery.
>
> The trick to treating this one PENDING database as a series of
> other independent queues would be to maintain a separate
database
> of QUEUES, which has pointers to all the queue heads.
Essentially,
> this is the 60-bit 'class' fingerprints, with zeroes in the
> remaining 68 bits. Seeking to the first entry greater than
these
> points will always yield the 'next' item in the respective
queues.
>
> The QUEUES database could keep other stats on the queue --
such
> as total entries -- and would in fact at any point have some
> smaller number of entries -- 'active queues' -- which are
maintained
> in plain in-memory objects, only occasionally synced with the
> QUEUES database.
>
> Different Frontier policies would largely be determined by
what
> set of QUEUES entries are actively used as pointers to get
> CrawlURIs from the PENDING database -- ie, what queues are
active.
>
> A secondary database into the QUEUES could even implement many
> forms of per-host preferencing: prefer the queue with the most
> items in it, or the one with the least, or the queue closest
> percentage-wise to finishing, or the queue furthest from
finishing,
> etc.
>
> 'Snoozing' queues for politeness would work as currently.
>
> There's a good chance that out-of-the-box, such an
approach would
> more efficiently time-ration disk accesses than our existing
> implementation, given that BDB...
> (1) uses a compressed serialization
> (2) uses append-only logs
> (3) has tunable use-based caching
>
> The patterns of access will usually be clumpy -- writing
batches of
> URIs into the same part of PENDING at once, continuing to
read URIs
> that were written together soon after each other.
>
> Thoughts?
>

> - Gordon



Date: 2004-08-05 20:55
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

A queue per host makes for a simpler implementation but
looking here at a broad crawl, theres massive I/O going on
creating dirs to hold queues. BDB will make files per DB.
Will have another version of same prob. unless some other
notion for queues.

BDB 'checkpoints'. Everytime it startsup it goes to look at
its logs to see the difference between them and the db state
restoring difference. You can set a parameter to have BDB
checkpoint itself at greater intervals.




Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-12-13 18:13 stack-sf
close_date - 2004-12-13 18:13 stack-sf
assigned_to nobody 2004-10-08 18:32 stack-sf
summary Berkeley DB Frontier 2004-08-03 16:12 stack-sf