Certain worst-case HTML (especially unclosed <script>
and <style> tags) in long docs (which use disk
overflow) can take hours or days to parse using the
regexp-based ExtractorHTML.
Examples:
http://neuromancer.eecs.umich.edu/dtr/twiki/bin/rdiff/TWiki/TextFormattingR
ules
[two open <script> tags without closing </script>]
http://www.healthypeople.gov/document/html/tracking/od01.htm
[one open <style> tag without closing </style>]
RELEVANT_TAG_EXTRACTOR pattern is open to too much
backtracking; changing several of the reluctant
all-char productions to be greedy all-but-'>'
productions seems to help significantly (bringing hours
down to seconds or minutes), and shouldn't cost any
matches.
Gordon Mohr
Extraction
1.0.6
Public
|
Date: 2007-03-14 00:17
|
|
Date: 2004-10-27 00:04 Logged In: YES |
|
Date: 2004-10-23 00:13 Logged In: YES |
|
Date: 2004-10-21 01:55 Logged In: YES |
|
Date: 2004-10-20 21:35 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-10-27 00:04 | gojomo |
| resolution_id | None | 2004-10-27 00:04 | gojomo |
| close_date | - | 2004-10-27 00:04 | gojomo |
| artifact_group_id | None | 2004-10-21 19:27 | stack-sf |
| priority | 7 | 2004-10-20 21:40 | gojomo |
| priority | 6 | 2004-10-20 21:40 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use