Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 new alreadyIncluded option: Bloom filter based - ID: 1225729
Last Update: Comment added ( karl-ia )

In a big long crawl (>50million discovered URIs,
>25million queued), crawl is slowing and thread dumps
suggest lots of time is spent in BDB 'critical evictions'.

Suspicion is that size of databases, plus disperse
lookup pattern of alreadyIncluded tests, is keeping the
BDB cache from staying usefully 'warm' with the right
records and btree nodes.


Gordon Mohr ( gojomo ) - 2005-06-22 18:49

8

Closed

None

Gordon Mohr

None

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 01:42
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-951 -- please add further
comments at that location.


Date: 2005-10-06 00:10
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I've added another variant, and made it the default for the
time being, that avoids the requirement for the bloom filter
bitfield size to be a power-of-two. Commit comment:

Another option for [ 1225729 ] new alreadyIncluded option:
Bloom filter based
* BloomFilter32bitSplit.java
an implementation that, like BloomFilter32bp2Split,
breaks the bitfield into parts (avoiding the GC bug with
giant arrays) -- but only rounds bitfield size up to nearest
1MB, rather than nearest power-of-two, so it doesn't force
(for example) a choice between a 512MB and 1GB bitfield --
we could have a 600MB one if we wanted.
* BenchmarkBlooms.java
add BloomFilter32bitSplit to benchmarked implementations
* BloomUriUniqFilter.java
make BloomFilter32bitSplit the used implementation


Date: 2005-09-23 21:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Narrowing this RFE to be specific to Bloom filter
alternative for large crawls.


Date: 2005-08-04 20:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Bloom-based alreadyIncludeds are new option. Needs doc/testing.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
summary BdbUriUniqFilter hits wall: need more alreadyIncluded 2005-09-23 21:59 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 7 2005-09-23 20:40 gojomo
assigned_to nobody 2005-08-04 20:51 gojomo