Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 ExtractorCSS regexp taking 'forever' on small document - ID: 1106469
Last Update: Comment added ( karl-ia )

One of the FR crawl nodes is waiting to pause, with
just one active thread:

ToeThread #12
#12
http://hotline.prem.fr/DotNetNuke/Portals/_default/Skins/skin_prem_dnn_1/st
yle.css
(0 attempts)
RE
Current processor: ExtractorCSS
ACTIVE for 50h1m38s597ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 180097394ms

It's maxing CPU. 'jstack' can't give a stack for the
exact thread, but kill -SIGQUIT does and shows it deep
in regexp matching:

[[much more omitted]]
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.match(Pattern.java:4696)
at
java.util.regex.Pattern$GroupTail.match(Pattern.java:4628)
at
java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4443)
at
java.util.regex.Pattern$GroupCurly.match(Pattern.java:4373)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4234)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Loop.matchInit(Pattern.java:4715)
at
java.util.regex.Pattern$Prolog.match(Pattern.java:4652)
at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4569)
at
java.util.regex.Pattern$Curly.match0(Pattern.java:4241)
at
java.util.regex.Pattern$Curly.match(Pattern.java:4196)
at
java.util.regex.Pattern$BitClass.match(Pattern.java:2876)
at
java.util.regex.Pattern$Slice.match(Pattern.java:3802)
at
java.util.regex.Pattern$Start.match(Pattern.java:3019)
at
java.util.regex.Matcher.search(Matcher.java:1092)
at java.util.regex.Matcher.find(Matcher.java:528)
at
org.archive.crawler.extractor.ExtractorCSS.processStyleCode(ExtractorCSS.ja
va:130)
at
org.archive.crawler.extractor.ExtractorCSS.innerProcess(ExtractorCSS.java:1
12)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)

The CSS_URI_EXTRACTOR needs to be tightened up. For
reference, the style.css file that gave the problem is
attached (in case the website changes).


Gordon Mohr ( gojomo ) - 2005-01-21 03:56

6

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:20
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-342 -- please add further
comments at that location.


Date: 2005-01-21 18:35
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix committed:

Fix for [ 1106469 ] ExtractorCSS regexp taking 'forever' on
small document
* ExtractorCSS.java
simplify CSS_URI_EXTRACTOR to avoid nested repeats, use
relucatant qualifiers
ALSO: narrow unescaping to only apply to character named
in CSS spec



Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-02-11 22:36 gojomo
close_date - 2005-02-11 22:36 gojomo
resolution_id None 2005-01-21 18:35 gojomo
assigned_to nobody 2005-01-21 18:27 gojomo