Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 ExtractorHTML takes forever on worst-case HTML - ID: 1051072
Last Update: Comment added ( karl-ia )

Certain worst-case HTML (especially unclosed <script>
and <style> tags) in long docs (which use disk
overflow) can take hours or days to parse using the
regexp-based ExtractorHTML.

Examples:

http://neuromancer.eecs.umich.edu/dtr/twiki/bin/rdiff/TWiki/TextFormattingR
ules
[two open <script> tags without closing </script>]

http://www.healthypeople.gov/document/html/tracking/od01.htm
[one open <style> tag without closing </style>]

RELEVANT_TAG_EXTRACTOR pattern is open to too much
backtracking; changing several of the reluctant
all-char productions to be greedy all-but-'>'
productions seems to help significantly (bringing hours
down to seconds or minutes), and shouldn't cost any
matches.


Gordon Mohr ( gojomo ) - 2004-10-20 21:34

9

Closed

Fixed

Gordon Mohr

Extraction

1.0.6

Public


Comments ( 5 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-267 -- please add further
comments at that location.


Date: 2004-10-27 00:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Improvement is best we'll do unless we do a comprehensive
revisit of regexp-based extractors. Closing as fixed.


Date: 2004-10-23 00:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tested 5 random pages as well as the two listed above. The
new regex is way faster and extracts same set of links (Had
trouble with arcihve.org; the page was changing on me so
downloaded page local to run the compares).


Date: 2004-10-21 01:55
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Improvement committed. Comment:

Improvement for [ 1051072 ] ExtractorHTML takes forever on
worst-case HTML
* ExtractorHTML.java
Tighten up RELEVANT_TAG_EXTRACTOR, by using possessive
qualifiers and [^>] instead of .*, to improve performance on
tricky HTML.

For reference, old and new versions:
// version w/ problems with unclosed script tags
// static final String RELEVANT_TAG_EXTRACTOR =
//
"(?is)<(?:((script.*?)>.*?</script)|((style.*?)>.*?</style)|(((meta)|(?:\\w+))\\s+.*?)|(!--.*?--))>";

// version w/ less unnecessary backtracking
static final String RELEVANT_TAG_EXTRACTOR =
"(?is)<(?:((script[^>]*+)>.*?</script)|((style[^>]*+)>[^<]*+</style)|(((meta)|(?:\\w+))\\s+[^>]*+)|(!--.*?--))>";




Date: 2004-10-20 21:35
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

May be related to/ duplicate of

[ 951607 ] STUCK in extractorHTML(But works usually!)
https://sourceforge.net/tracker/index.php?func=detail&aid=951607&group_id=73833&atid=539099


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-10-27 00:04 gojomo
resolution_id None 2004-10-27 00:04 gojomo
close_date - 2004-10-27 00:04 gojomo
artifact_group_id None 2004-10-21 19:27 stack-sf
priority 7 2004-10-20 21:40 gojomo
priority 6 2004-10-20 21:40 gojomo