Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 OOM hit very early broad-crawling - ID: 1002164
Last Update: Comment added ( karl-ia )

Ideally any OutOfMemoryError is too many, but on
open-ended crawls we're hitting them very early, and a
little slimming down of crawler structures and
non-urgent data could probably result in a crawler
lifetime multiple longer.

For example, a broad-site-first test crawl, already
seriously slowed by issue [ 1000865 ] (Long random
pauses where no progress is made), hit OOM after 11
hours and 220000 documents. (Cannot see how many queues
were active/existent in Frontier, due to problems
scaling frontier report up.)

Some possible quick-improvements in memory footprint:
(1) In ServerCache, flush unused CrawlServers, or
otherwise collapse robots info to just what is needed.
(2) In Frontier, discard empty, inactive queues more
aggressively.
(3) Implement capped-size alreadyIncluded (See [ 999849
] alreadyIncluded as capped-size cache without disk
backing.)



Gordon Mohr ( gojomo ) - 2004-08-02 18:56

7

Closed

Duplicate

Michael Stack

None

None

Public


Comments ( 4 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-215 -- please add further
comments at that location.


Date: 2005-03-02 19:48
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Closing as duplicate of RFE: "Never OOM"
https://sourceforge.net/tracker/index.php?func=detail&aid=1020779&group_id=73833&atid=539102


Date: 2004-08-05 22:50
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

For 1.0.0, I suggest we simply doc that crawls larger than
10,000 hosts and 10,000,000 resources are discouraged unless
you have signficantly greater than 256MB ram to assign the
java heap; and even adding ram up only multiplies that
capcity by a few times.




Date: 2004-08-02 23:33
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

FYI Same crawl as mentioned above, not-site-first, hit OOM
in just an hour/30000 fetches, because of the much larger
fanout of sites discovered. Memory profiling at the time of
OOM would probably highlight a number of key candidates for
slimming.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2005-03-02 19:48 gojomo
close_date - 2005-03-02 19:48 gojomo
resolution_id None 2005-03-02 19:48 gojomo
summary OOM hit very early 2004-12-10 00:20 stack-sf
assigned_to nobody 2004-12-03 22:51 gojomo
priority 6 2004-12-03 22:50 gojomo
priority 5 2004-09-01 21:57 gojomo