Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Flushing CrawlServers problematic - ID: 958096
Last Update: Comment added ( karl-ia )

When a CrawlServer instance is no longer needed, it
could be discarded to save memory, and reinstantiated
when needed later. (If they prove necessary for keeping
running stats, replace 'discarded' with 'persisted to
disk'.)


Gordon Mohr ( gojomo ) - 2004-05-21 15:40

7

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 7 )

Date: 2007-03-14 00:12
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-149 -- please add further
comments at that location.


Date: 2004-11-03 18:42
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Applied immediate fix: removed the
weakreference/softreference flushable cache mechanism.
Closing; ServerCache growth will be revisited by [ 1020779 ]
Never OOM.

Fix for [ 958096 ] Flushing CrawlServers problematic
* ServerCache.java
Revert back to strong/direct references, because risk of
OOMs is better, for now, than hysteresis.


Date: 2004-10-28 00:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

At the end of the ~2 week umich crawl, definitely saw
redundant refetches of dns and robots of remaining sites,
sometimes sooner than 1 minute after last fetch. Suspect
that flushing of CrawlServer instances while queues were
briefly snoozed is the culprit.

CrawlServer cache should probably be a BDB-backed
collection, fronted by both a soft-reference cache (that
gets flushed in low-mem conditions), and perhaps also a hard
MRU cache of the last N*(# of toethreads) used CrawlServer
instances, so that nothing that's 'active' gets churned to
disk.


Date: 2004-10-20 21:56
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Should be fixed for 1.2.


Date: 2004-10-15 00:09
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This enhancement has problems...

- In a broad BdbFrontier crawl starting from
directory.google.com, I've seen what appears to be
dns-flushing-hysteresis: by the time a non-DNS URI comes up,
the DNS info has been lost to a soft-reference flush,
triggering a new DNS lookup. But the info doesn't stick
around long enough to be there for the first non-DNS URI,
causing a repeat.

- HostQueuesFrontier/KeyedQueue attempts to prevent this by
having any active KeyedQueue hold a reference to the
matching CrawlServer, thus preventing its GC. However, since
the CrawlServer <-> KeyedQueue mapping is not always
one-to-one, this will eventually show problems, too.

A fix could be for BdbFrontier and HostQueuesFrontier to
hold sets of CrawlServers against GCing; better though would
probably be to convert eh CrawlServer cache into a
disk-persisted collection like those offered by BDBJE.




Date: 2004-09-01 23:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

ServerCache has been changed in HEAD to use soft references
-- all CrawlServers (and CrawlHosts) not strongly referenced
can be GCd in low-mem conditions.


Date: 2004-07-29 00:41
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Should be easy to achieve this by making the ServerCache use
SoftReferences and a WeakDictionary.

To keep 'active' CrawlServers from ever being discarded,
KeyedQueue could hold a CrawlServer while ACTIVE.

One harm to discarding a CrawlServer would be needing to
redo the IP and Robot lookups... but that's minor, and the
info could be persisted to disk if necessary.

Another open issue: would losing the credentials avatars
when a CrawlServer is discarded be any concern? (Or woudl
they just be refreshed when needed?)


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
close_date 2004-09-01 23:04 2004-11-03 18:42 gojomo
resolution_id None 2004-11-03 18:42 gojomo
status_id Open 2004-11-03 18:42 gojomo
summary Flush unneeded CrawlServers 2004-10-28 00:53 gojomo
priority 6 2004-10-20 21:56 gojomo
data_type 539099 2004-10-15 00:09 gojomo
status_id Closed 2004-10-15 00:09 gojomo
close_date - 2004-09-01 23:04 gojomo
assigned_to nobody 2004-09-01 23:04 gojomo
status_id Open 2004-09-01 23:04 gojomo
priority 5 2004-07-29 00:55 gojomo