Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 ExtractorJS takes forever on worst-case JS - ID: 1051916
Last Update: Comment added ( karl-ia )

Igor reports:
> ToeThread #96
> #96
http://www.ncptt.nps.gov/search_pdf/searchIndex.js (0
attempts)
> LLLLE
http://www.ncptt.nps.gov/search_pdf/searchresults.cfm
> Current processor: ExtractorJS
> ACTIVE for 110h6m25s604ms
> Where: ABOUT_TO_BEGIN_PROCESSOR

Document is 6MB, primarily a database in giant (>500K
each) JS strings.

Trying a fix which tightens ExtractorJS's
JAVASCRIPT_STRING_EXTRACTOR to only accept
whitespace-free strings shorter than the
max-legal-URI-size we accept.


Gordon Mohr ( gojomo ) - 2004-10-22 02:23

9

Closed

Fixed

Gordon Mohr

None

1.0.6

Public


Comments ( 5 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-268 -- please add further
comments at that location.


Date: 2004-10-27 00:03
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Improvement is the best we'll do failing a comprehensive
revisit of all extractors. Closing as fixed.


Date: 2004-10-23 00:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Comparing old and new versions of the regex, two random
pages with javascript content have the same speculative URLs
extracted..


Date: 2004-10-23 00:44
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Comparing old and new versions of the regex, two random
pages with javascript content have the same speculative URLs
extracted..


Date: 2004-10-22 02:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Old and new regexps:

// MAY HAVE EFFICIENCY PROBLEMS
// static final String JAVASCRIPT_STRING_EXTRACTOR =
//
"(\\\\*(?:\"|\'))((?:[^\\n\\r]*?[^\\n\\r\\\\])??)(?:\\1)";

// finds whitespace-free strings in Javascript
// (areas between paired ' or " characters, possibly
backslash-quoted
// on the ends, but not in the middle)
static final String JAVASCRIPT_STRING_EXTRACTOR =
"(\\\\*(?:\"|\'))(\\S{0,2083}?)(?:\\1)";

Commit comment:

Improvement for [ 1051916 ] ExtractorJS takes forever on
worst-case JS
* ExtractorJS.java
Simplified inside of quotes to be up to 2083
non-whitespace characters, reluctant. Improves a problem
case to under a minute; still needs more testing.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-10-27 00:03 gojomo
resolution_id None 2004-10-27 00:03 gojomo
artifact_group_id None 2004-10-27 00:03 gojomo
close_date - 2004-10-27 00:03 gojomo
priority 5 2004-10-22 02:24 gojomo
assigned_to nobody 2004-10-22 02:24 gojomo