I happen to have a seed list of nearly 1024 entries.
Not totally surprisingly, Heritrix behaves a little
oddly with that many seeds. First, crawls with either
0.6.0 or the latest CVS build fail because too many
files are opened almost immediately, and then neither
socket operations nor file logging are able to proceed.
A typical exception:
.....
Next up, using the current CVS build, a surprising
number (like, ~70) of
java.util.ConcurrentModificationExceptions occurred in
the first moments of the crawl (and then intermittently
throughout), all with the same stack trace. An example:
20040427194255925 -5 39804 #48
http://eia.doe.gov/ 124 text/html 3t
java.util.ConcurrentModificationException
at
java.util.AbstractList$Itr.checkForComodification(AbstractList.java:448)
at
java.util.AbstractList$Itr.next(AbstractList.java:419)
at
org.archive.crawler.scope.HostScope.focusAccepts(HostScope.java:120)
at
org.archive.crawler.framework.CrawlScope.innerAccepts(CrawlScope.java:198)
at
org.archive.crawler.framework.Filter.accepts(Filter.java:94)
at
org.archive.crawler.basic.Postselector.schedule(Postselector.java:200)
at
org.archive.crawler.basic.Postselector.handleLinkCollection(Postselector.ja
va:262)
at
org.archive.crawler.basic.Postselector.innerProcess(Postselector.java:112)
at
org.archive.crawler.framework.Processor.process(Processor.java:106)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:205)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)
Looking at the code, it looks like the CrawlScope class
hands out an iterator on the scope's seeds list; that
iteration needs to synchronize on the list (per the
note in
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Collections.html#synchron
izedCollection(java.util.Collection)
), which I guess is going to take some refactoring.
Should it be relevant, the few changes made to the
default configuration for this crawl, other than adding
a pile of seeds, were:
- HostScope
- max-link-hops 1
- total-bandwidth-usage-KB-sec 500
Otherwise, the crawl for this large seed list seems to
be proceeding apace.
Michael Stack
General
None
Public
|
Date: 2007-03-14 00:10
|
|
Date: 2004-04-30 01:01 Logged In: YES |
|
Date: 2004-04-28 15:25 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-04-30 01:01 | stack-sf |
| resolution_id | None | 2004-04-30 01:01 | stack-sf |
| close_date | - | 2004-04-30 01:01 | stack-sf |
| category_id | None | 2004-04-28 15:25 | stack-sf |
| assigned_to | nobody | 2004-04-28 15:25 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use