Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 Unnecessary toString() in ExtractorHTML.processScriptCode() - ID: 1045847
Last Update: Comment added ( karl-ia )

ExtractorHTML.processScriptCode() converts a
CharSequence into a String before passing to
ExtractorJS.considerStrings() -- even though the latter
is perfectly happy with a CharSequence.

Noticed because on a crawl which encountered a 1+MB
obfuscated javascript segment, this conversion
triggered an OOM. (NARA-MIL test crawl). An OOM might
have been inevitable, but this attempted allocation of
a 2+MB (at 2 bytes per character) temporary String
didn't help.

Fix is just to not convert to a String.

Treating as high-priority, low-risk fix for 1.0.x.


Gordon Mohr ( gojomo ) - 2004-10-13 02:14

9

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-259 -- please add further
comments at that location.


Date: 2004-10-13 02:18
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix for [ 1045847 ] Unnecessary toString() in
ExtractorHTML.processScriptCode()
* ExtractorHTML.java
Don't convert CharSequence toString before passing to
script-handling method.


Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-10-13 02:18 gojomo
resolution_id None 2004-10-13 02:18 gojomo
close_date - 2004-10-13 02:18 gojomo