Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 OOME guard against pages of thousands of links - ID: 1192029
Last Update: Comment added ( karl-ia )

Igor came across this page doing the uk election crawl:

http://www.dwnconservatives.com/blog.php?sectionid=9

This page, and others from this site, cause us to
OOME.-- probably because of the number of links.

Defend against pages like these by cutting off
processing at an upper bound (Page size or link count).


Michael Stack ( stack-sf ) - 2005-04-28 22:18

6

Closed

Fixed

Gordon Mohr

None

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:51
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-400 -- please add further
comments at that location.


Date: 2005-07-22 03:03
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

The OOM guard of a simple threshold is in place; Stack
further made it property-configurable. (Property:
org.archive.crawler.datamodel.CrawlURI.maxOutLinks).
Closing; further configurability or robustness against
arbitrary numbers of legit outlinks should be considered as
part of new RFE [ 1242766 ] Massive outlinks: make threshold
optional, configurable



Date: 2005-06-09 02:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Taking and upping priority as this is a possible cause of
as-yet-unexplained OOMs in giant-heap crawls.

Committing initial truncation of outlinks at 6000 and
crawlURI annotation for crawl.log that this has been done:

First fix for [ 1192029 ] OOME guard against pages of
thousands of links
* CrawlURI.java
enforce a max number of outlinks; discard overflow,
noting count of discarded, and annotating CrawlURI so it's
identifiable for further investigation in the crawl.log




Date: 2005-04-28 23:11
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I don't know what a reasonable default max-links-to-extract
would be. Maybe 5,000? Perhaps the Extractors could remember
the most extreme counts they encounter for the Processors
report, to help us notice the outliers.

That extraction was cut-off for a certain resource should be
noted somewhere -- an alert or at the very least a crawl.log
annotation.

If we thought any such giant link collections were
legitimate, or wanted protection against the worst-case
scenario of several max-links pages being processed at once,
we could make the link collections overflow to disk as with
other crawler datastructures. At thsi point, I think that
step would be overkill.




Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:02 gojomo
status_id Open 2005-07-22 03:03 gojomo
resolution_id None 2005-07-22 03:03 gojomo
close_date - 2005-07-22 03:03 gojomo
priority 5 2005-06-09 02:59 gojomo
assigned_to nobody 2005-06-09 02:59 gojomo