Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

3 Flash link extractor causes OutOfMemory exceptions. - ID: 877873
Last Update: Comment added ( karl-ia )

The ExtractorSWF requires excessive amounts of memory
to parse some documents. 100+ MB to parse a ~20 MB file.

Files that are known to cause this:
http://www.umd.umich.edu/casl/natsci/slc/slconline/RECRYS/Recrystallization
.swf
http://www.umd.umich.edu/casl/natsci/slc/slconline/DIST/Distillation.swf

If the crawler is near it's memory limits then it can't
handle this and an exception occurs that can derail the
entire crawl.

Until fixed using this extractor must be considered unsafe.


Kristinn Sigurdsson ( kristinn_sig ) - 2004-01-15 23:04

3

Closed

None

Igor Ranitovic

Extraction

None

Public


Comments ( 2 )

Date: 2007-03-14 00:07
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-51 -- please add further
comments at that location.


Date: 2004-03-25 01:28
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

Fixed.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-03-25 01:28 ia_igor
close_date - 2004-03-25 01:28 ia_igor
assigned_to nobody 2004-02-17 22:37 gojomo
priority 5 2004-02-17 22:36 gojomo