Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 old crawls stick around, consuming memory - ID: 1020770
Last Update: Comment added ( karl-ia )

See:

http://groups.yahoo.com/group/archive-crawler/message/907

Seemd OK before 1.0.0, but is a problem in 1.0.0 and
HEAD, apparently.

May be mainly a problem with crawls that end themselves
(via reaching exhaustion or max-limits).

I believe this may be due to the following:

After a crawl runs, this reference chain still exists:

CrawlJobHandler->CrawlJob(completed)->SettingsHandler->SettingsCache->
CrawlerSettings(in the non-soft globalSettings
field of SettingsCache)->
Map of
localComplexTypes->->->DataContainer->Frontier->alreadyIncluded

At some poiunt pre-1.0.0, it was being broken; now it
isn't, and big things are sticking around.


Gordon Mohr ( gojomo ) - 2004-09-01 22:00

9

Closed

Fixed

Michael Stack

None

1.0.1

Public


Comments ( 4 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-241 -- please add further
comments at that location.


Date: 2004-09-08 03:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing this issue. Can start up 5 jobs one after the other
now in a 256m heap where before I'd get an OOME a little way
into the 3rd.

Here is the commit message.

Workaround for "[ 1020770 ] old crawls stick around,
consuming memory"
Needs to be better addressed in HEAD. It probably has same
problem.
* src/java/org/archive/crawler/frontier/Frontier.java
This patch adds crawlEnded to Frontier and registers
Frontier with the
CrawlContoller for crawlEnded events. On end, clears
its alreadySeen
queue. Need to do this because settings has reference
to Frontier.
CrawlJob has reference to settings. CrawlJob is kept
around. It needs
settings or any subsequent use by UI throws NPE.
Settings doesn't have
any developed means of clearing its cache. All needs to
be done in
HEAD. Also added checks for crawl ended in next and
finished, the two
methods all the ToeTheads hang on so that end-of-crawl
is noticed sooner.



Date: 2004-09-08 03:26
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Had to remove the CrawlJob cleanup because if the job is
subsequently referenced in the UI (say you want to make a
new job based on it), because the settings are empty, UI
throws an NPE.

Also found that the Frontier#crawlEnded was not being
called; it wasn't registering itself with the
CrawlController though it implemeted the crawl status
events. I added this.

This way of the addressing the problem is a workaround only.
Good enough for 1.0.0. Needs to be properly addressed
(Probably as part of [ 999839 ] revisit crawl status
signalling: controller/threads/frontier). There may be
other references being held by crawljob but if there are,
they're not easy to see -- at least they ain't the initial
64meg frontier alreadySeen size.


Date: 2004-09-08 01:39
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I've attached a chart that shows frontier references (All
the pretty images probably be broken). Picture was taken
after running of a couple of user-terminated jobs. Shows
StatisticsTracker and Settings cache holding references to
frontier. The StatisticsTracker goes and gets the frontier
each time it uses it so it shouldn't be holding references
so direct culprit is settings framework cache. The settings
framework is being held by CrawlJob. CrawlJob has no clear
'end-of-job' notion and crawljobs are kept around after a
crawl is done as part of the completed jobs list. Meantime
it holding references as per gordon's speculation above.

What happens on crawl end is distributed all over and is
hard to follow.

Am testing a patch that does explicit clear of the
alreadySeen on crawlended and have added to crawljobs clear
out of settings handler references.


Attached File ( 1 )

Filename Description Download
frontier_references.html Jprofiler heap walk of frontier references. Download

Changes ( 8 )

Field Old Value Date By
artifact_group_id None 2004-09-14 03:52 stack-sf
assigned_to nobody 2004-09-14 03:52 stack-sf
resolution_id None 2004-09-08 03:49 stack-sf
status_id Open 2004-09-08 03:49 stack-sf
close_date - 2004-09-08 03:49 stack-sf
File Added 100710: frontier_references.html 2004-09-08 01:39 stack-sf
priority 6 2004-09-01 22:06 gojomo
priority 5 2004-09-01 22:05 gojomo