Ideally any OutOfMemoryError is too many, but on
open-ended crawls we're hitting them very early, and a
little slimming down of crawler structures and
non-urgent data could probably result in a crawler
lifetime multiple longer.
For example, a broad-site-first test crawl, already
seriously slowed by issue [ 1000865 ] (Long random
pauses where no progress is made), hit OOM after 11
hours and 220000 documents. (Cannot see how many queues
were active/existent in Frontier, due to problems
scaling frontier report up.)
Some possible quick-improvements in memory footprint:
(1) In ServerCache, flush unused CrawlServers, or
otherwise collapse robots info to just what is needed.
(2) In Frontier, discard empty, inactive queues more
aggressively.
(3) Implement capped-size alreadyIncluded (See [ 999849
] alreadyIncluded as capped-size cache without disk
backing.)
Michael Stack
None
None
Public
|
Date: 2007-03-14 00:15
|
|
Date: 2005-03-02 19:48 Logged In: YES |
|
Date: 2004-08-05 22:50 Logged In: YES |
|
Date: 2004-08-02 23:33 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-03-02 19:48 | gojomo |
| close_date | - | 2005-03-02 19:48 | gojomo |
| resolution_id | None | 2005-03-02 19:48 | gojomo |
| summary | OOM hit very early | 2004-12-10 00:20 | stack-sf |
| assigned_to | nobody | 2004-12-03 22:51 | gojomo |
| priority | 6 | 2004-12-03 22:50 | gojomo |
| priority | 5 | 2004-09-01 21:57 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use