Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 ExtractorHTML excessive temp strings / OOM - ID: 1220714
Last Update: Comment added ( karl-ia )

ExtractorHTML when given pathological input sometimes
creates Strings from CharSequences that are
impractically long.

For example, this page:
http://blueliners.com.au/guestbook/guestbook.html

... had essentially a...

<a href="http://something[23MB of \0
characters]blahblah">

This 23MB CharSequence, passed as 'value' into
processLink(), was then becoming a String instance
(~46MB in size) in its "TextUtils.replaceAll()"
amp-escaping.

Generally, every CharSequence.toString(),
Matcher.group(), and TextUtils.replaceAll() in
ExtractorHTML creates a String, and we should take care
not to create Strings from excessively long junk input.

The regexps can be tightened so that where they would
have taken '+' or '*' they instead take '{1,N}' or
'{0,N}', where N is an appropriate maximum value for
the context.



Gordon Mohr ( gojomo ) - 2005-06-14 20:23

9

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:55
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-442 -- please add further
comments at that location.


Date: 2005-06-15 22:42
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Believed fixed; assigning to karl for verification & closing.


Date: 2005-06-14 23:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Also evaluated ExtractorJS and ExtractorCSS for the same
risk. ExtractorJS is already protected by a max-matches
construct, but it uses '2083' -- as the intent here is to
match UURI.MAX_URL_LENGTH that shoudl be used by reference.

ExtractorCSS has arbitrarily long match inside URL
extractor. Should be capped at MAX_URL_LENGTH.

Changes made. Commit comment:

Followup for [ 1220714 ] ExtractorHTML excessive temp
strings / OOM
* ExtractorJS.java
adapt existing limit to refer to UURI.MAX_URL_LENGTH
* ExtractorCSS.java
add UURI.MAX_URL_LENGTH limit to url-match group


Date: 2005-06-14 20:39
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix applied after looking over all
toString()/group()/replaceAll() uses inside ExtractorHTML,
and making sure related capturing groups are not arbitrarily
long.

Commit comment:

Fix for [ 1220714 ] ExtractorHTML excessive temp strings / OOM
* ExtractorHTML.java
tighten regexes that allowed arbitrarily-long inner
groups to limit those captures to reasonable lengths; ensure
no charsequences of arbitrary length get String-ified


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to gojomo 2005-06-15 22:42 gojomo